WO2021189981A1

WO2021189981A1 - Voice noise processing method and apparatus, and computer device and storage medium

Info

Publication number: WO2021189981A1
Application number: PCT/CN2020/136367
Authority: WO
Inventors: 罗剑; 王健宗; 程宁
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-10-26
Filing date: 2020-12-15
Publication date: 2021-09-30
Also published as: CN112201270B; CN112201270A

Abstract

A voice noise processing method and apparatus, and a computer device and a storage medium, which relate to the field of artificial intelligence. The method comprises: acquiring a voice sequence to be subjected to recognition (101); performing noise recognition on the voice sequence, and if the voice sequence includes a voice noise, determining, by using a preset noise classification model, a noise category corresponding to the voice noise, wherein the noise classification model is obtained by jointly training same and a plurality of noise generation models, and the categories of voice noises generated by different noise generation models are different (102); and on the basis of the noise category, determining an optimal noise reduction processing strategy corresponding to the voice noise, and performing noise reduction processing on the voice noise by using the optimal noise reduction processing strategy (103). The categories of voice noises in different scenarios can be recognized, and the voice noises are processed by means of an appropriate noise reduction processing manner and according to the recognized noise categories, so as to achieve an optimal noise reduction processing effect.

Description

Speech noise processing method, device, computer equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on October 26, 2020, the application number is 202011153509.1, and the invention title is "Speech noise processing methods, devices, computer equipment, and storage media". The entire content of the application is approved The reference is incorporated in this application.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a method, device, computer equipment, and storage medium for processing speech noise.

Background technique

In speech recognition technology, it is usually necessary to recognize the noise in the speech sequence and perform noise reduction processing on the recognized noise to improve the accuracy of subsequent speech recognition. Therefore, it is very important to effectively process the speech noise.

At present, in the process of processing voice noise, the voice noise is usually recognized first, and after the voice noise is recognized, a unified noise reduction processing method is used to process the voice noise. However, the inventor realized that this method cannot identify the types of speech noise. The types of speech noise in different scenarios are different. If the same noise reduction processing method is used to process the speech noise in different scenarios, The noise reduction effect that can be achieved is limited, that is, the optimal noise reduction effect cannot be achieved in different scenarios.

technical problem

This application provides a method, device, computer equipment, and storage medium for processing speech noise, mainly in that it can identify the types of speech noise in different scenarios, and adopt an appropriate noise reduction processing method according to the recognized noise type. Perform processing to achieve the optimal noise reduction processing effect.

Technical solutions

According to the first aspect of the present application, a method for processing speech noise is provided, including:

Obtain the voice sequence to be recognized;

Perform noise recognition on the voice sequence. If the voice sequence contains voice noise, use a preset noise classification model to determine the noise category corresponding to the voice noise, wherein the noise classification model is generated from multiple noises. The types of speech noise generated by different noise generation models are different when the models are jointly trained;

Based on the noise category, an optimal noise reduction processing strategy corresponding to the speech noise is determined, and the optimal noise reduction processing strategy is used to perform noise reduction processing on the speech noise.

According to a second aspect of the present application, there is provided a speech noise processing device, including:

The acquiring unit is used to acquire the voice sequence to be recognized;

The determining unit is configured to perform noise recognition on the voice sequence, and if the voice sequence contains voice noise, use a preset noise classification model to determine the noise category corresponding to the voice noise, wherein the noise classification model is It is obtained by joint training with multiple noise generation models, and the types of speech noise generated by different noise generation models are different;

The noise reduction unit is configured to determine an optimal noise reduction processing strategy corresponding to the speech noise based on the noise category, and use the optimal noise reduction processing strategy to perform noise reduction processing on the speech noise.

According to a third aspect of the present application, there is provided a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of a method for processing speech noise are realized:

Obtain the voice sequence to be recognized;

According to a fourth aspect of the present application, there is provided a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor. The processor implements a voice noise when the program is executed. The steps of the processing method:

Obtain the voice sequence to be recognized;

Beneficial effect

The speech noise processing method, device, computer equipment, and storage medium provided in this application are compared with the current method of using the same noise reduction strategy for noise reduction processing for different types of speech noise. This application can obtain the to-be-identified The voice sequence; and noise recognition is performed on the voice sequence. If the voice sequence contains voice noise, a preset noise classification model is used to determine the noise category corresponding to the voice noise, wherein the noise classification model is It is obtained by joint training with multiple noise generation models, and the types of speech noise generated by different noise generation models are different; at the same time, based on the noise category, determine the optimal noise reduction processing strategy corresponding to the speech noise, and use The optimal noise reduction processing strategy performs noise reduction processing on the speech noise, so that the noise classification model and multiple noise generation models are jointly trained, so that the noise classification model in this application can be used for speech in different scenarios. The type of noise is identified, and then the optimal noise reduction processing strategy can be selected to process the speech noise according to the determined noise category, and the optimal noise reduction processing effect can be achieved.

Description of the drawings

The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The exemplary embodiments and descriptions of the application are used to explain the application, and do not constitute an improper limitation of the application. In the attached picture:

Fig. 1 shows a flowchart of a method for processing speech noise provided by an embodiment of the present application;

FIG. 2 shows a flowchart of another method for processing voice noise according to an embodiment of the present application;

FIG. 3 shows a schematic structural diagram of a speech noise processing apparatus provided by an embodiment of the present application;

FIG. 4 shows a schematic structural diagram of another apparatus for processing speech noise according to an embodiment of the present application;

Fig. 5 shows a schematic diagram of the physical structure of a computer device provided by an embodiment of the present application.

The best mode of the present invention

Hereinafter, the present application will be described in detail with reference to the drawings and in conjunction with the embodiments. It should be noted that the embodiments in the application and the features in the embodiments can be combined with each other if there is no conflict.

At present, in the process of processing voice noise, the voice noise is usually recognized first, and after the voice noise is recognized, a unified noise reduction processing method is used to process the voice noise. However, this method cannot identify the types of speech noise. The types of speech noise in different scenarios are different. If the same noise reduction processing method is used to process the speech noise in different scenarios, the reduction that can be achieved can be achieved. The noise effect is limited, that is, the optimal noise reduction effect cannot be achieved in different scenes.

In order to solve the foregoing problem, an embodiment of the present application provides a method for processing speech noise. As shown in FIG. 1, the method includes:

101. Acquire a voice sequence to be recognized.

Among them, the voice sequence to be recognized is a user voice sequence obtained from a certain scene. For example, the voice sequence to be recognized is a user voice sequence collected on the side of a street, or a user voice sequence collected from a factory. The voice sequence may or may not contain voice noise. For this embodiment of the application, in order to improve the accuracy of the user’s voice recognition, it is necessary to determine whether the collected user’s voice sequence contains voice noise. If it contains voice noise, then The user’s speech sequence needs to be denoised in order to improve the accuracy of the user’s speech recognition. In the specific noise reduction process, an appropriate noise reduction strategy can be selected according to the type of speech noise to process the speech noise in order to achieve the optimal For noise reduction effects, the embodiments of the present application are mainly applicable to the processing of speech noise. The execution subject of the embodiments of the present application is a device or device capable of processing speech noise, which can be set on the client or server side.

Specifically, to obtain a user's speech sequence in a certain scene, before judging whether the speech sequence contains speech noise, it is necessary to pre-process the obtained user's speech sequence, including pre-emphasis processing, framing processing, and windowing function Processing, the preprocessed speech sequence is obtained, and the preprocessed speech sequence is used as the speech sequence to be recognized, so as to determine whether the speech sequence to be recognized contains speech noise, if the speech sequence to be recognized does not contain speech If there is noise, the speech sequence to be recognized is directly recognized; if the speech sequence to be recognized contains speech noise, it is necessary to further determine the type of noise required, so as to select the appropriate noise reduction process according to the determined type of speech noise Strategies for noise reduction processing, so as to achieve the best noise reduction effect.

102. Perform noise recognition on the voice sequence, and if the voice sequence contains voice noise, use a preset noise classification model to determine a noise category corresponding to the voice noise.

Wherein, the noise classification model is obtained through joint training with multiple noise generation models. The types of speech noise generated by different noise generation models are different. In addition, the types of speech noise in different scenarios are different, for example, collected on the side of the street. The type of speech noise is different from the type of speech noise collected in the factory. For the embodiment of this application, in order to determine whether the speech sequence to be recognized contains speech noise, the speech sequence to be recognized is input into a preset noise recognition model for noise recognition The preset noise recognition model may specifically be a first preset neural network model. In the process of using the first preset neural network model to recognize speech noise, the hidden layer in the first preset neural network model extracts the to-be-recognized According to the voice features corresponding to the voice sequence, it is determined whether the voice sequence to be recognized contains voice noise according to the extracted voice features. If the voice sequence to be recognized does not contain voice noise, then the extracted voice feature is directly subjected to voice recognition; When the recognized speech sequence contains speech noise, the extracted speech features are input into a preset noise classification model for noise classification. The noise classification model may be a second preset neural network model. When performing noise classification, use The hidden layer in the second preset neural network model extracts the noise features corresponding to the voice noise, and then determines the noise type corresponding to the voice noise contained in the speech sequence to be recognized according to the extracted noise feature, so as to select according to the determined noise type A suitable noise reduction processing strategy performs noise reduction processing on the speech sequence to be recognized to achieve the optimal noise reduction effect in the scene.

Among them, different types of speech noise are suitable for different optimal noise reduction processing strategies. For example, for speech noise from the side of the street, because the noise on the side of the street is relatively random and the noise has a wide spectrum range, it can be used. Adaptive filter for noise reduction; for the speech noise from the factory, since most of the speech noise in the factory is machine processing noise in the workshop, the randomness of the noise is small, and the noise spectrum range is narrow, so adaptive trapping can be used. The wave generator performs noise reduction processing. For this embodiment, according to the noise category corresponding to the determined speech noise, select the noise reduction processing strategy corresponding to the noise category from the preset noise reduction strategy library, and determine it as the optimal reduction Noise processing strategy, and then use the optimal noise reduction processing strategy to reduce the noise in the speech noise in the speech sequence to be recognized, so that the optimal noise reduction processing effect can be achieved for the speech noise in different scenarios, avoiding the use of uniform Noise reduction processing strategy, noise reduction processing effect of image speech noise.

The method for processing speech noise provided by the embodiment of the present application is compared with the current manner in which the same noise reduction strategy is used for noise reduction processing for different types of speech noise, the present application can obtain the voice sequence to be recognized; and Noise recognition is performed on the speech sequence, and if the speech sequence contains speech noise, a preset noise classification model is used to determine the noise category corresponding to the speech noise, wherein the noise classification model is related to multiple noise generation models The types of speech noise generated by different noise generation models are different from the joint training; at the same time, based on the noise category, the optimal noise reduction processing strategy corresponding to the speech noise is determined, and the optimal noise reduction is used The processing strategy performs noise reduction processing on the speech noise, so that the noise classification model and multiple noise generation models are jointly trained, so that the noise classification model in this application can recognize the types of speech noise in different scenarios. Furthermore, according to the determined noise category, the optimal noise reduction processing strategy can be selected to process the speech noise, and the optimal noise reduction processing effect can be achieved.

Further, in order to better explain the above-mentioned speech noise processing process, as a refinement and extension of the above-mentioned embodiment, an embodiment of the present application provides another method for processing speech noise. As shown in FIG. 2, the method include:

201. Obtain a real voice sequence and a plurality of random voice sequences in a preset voice sample library, and perform clustering processing on the real voice sequence to obtain real voice sequences in different clustering categories.

Among them, multiple random speech sequences can obey Gaussian distribution. The real speech sequence is the real speech sequence of the user collected in different scenes. The real speech sequence is processed by noise reduction, and there is no noise, and the speech recognition can be directly performed. In the embodiment, it is desired to use multiple random voice sequences and multiple noise generation models to simulate the real voice sequence of the user in different scenarios, thereby generating voice noise in different scenarios, and then according to the generated voice noise and the voice noise in different scenarios. For real speech sequences in different scenarios, a noise recognition model and a noise classification model are constructed respectively to achieve the purpose of recognizing and classifying speech noise.

For the embodiment of this application, the real voice sequence of the user in the preset sample library is obtained. The real voice sequence comes from different scenarios. In order to use the real voice sequence and random voice sequence in different scenarios to construct a noise recognition model and a noise classification model, The real speech sequences in the preset sample library need to be clustered first. Based on this, step 201 specifically includes: calculating the Euclidean distance between different real speech sequences according to the preset Euclidean distance algorithm; based on the Euclidean distance, The real speech sequence is clustered to obtain real speech sequences in different clustering categories. Because the voice sequences in different scenarios are relatively similar, the voice sequences in the preset sample library are clustered to obtain the real voice sequences under different clustering categories, and the scenes corresponding to the real voice sequences under different clustering categories are determined , And then be able to determine the real voice sequence in different scenarios.

Specifically, the Euclidean distance between different real speech sequences is calculated according to the preset Euclidean distance algorithm, and the real speech sequence is clustered according to the calculated Euclidean distance to obtain real speech sequences in different clustering categories, and then extract The voice features corresponding to the real voice sequences under different clustering categories are determined, and the scenes corresponding to the real voice sequences under different clustering categories are determined. For example, it is determined that the real voice sequences 1-10 are the voice sequences collected on the street, and the voice sequences 11- 20 is the voice sequence collected in the factory, which can determine the real voice sequence in different scenarios.

202. Construct the noise classification model and the multiple noise generation models according to the multiple random voice sequences and the real voice sequences in the different clustering categories.

For the embodiment of the present application, in order to construct a noise classification model and multiple noise generation models, step 202 specifically includes: constructing an initial noise classification model and multiple initial noise generation models respectively; For real speech sequences in a class category, joint iterative training is performed on the initial noise classification model and the multiple initial noise generation models to construct the noise classification model and the multiple noise generation models. Further, in order to be able to recognize speech noise, it is also necessary to construct a noise recognition model, which separately constructs an initial noise classification model and multiple initial noise generation models, including: separately constructing an initial noise recognition model, an initial noise classification model, and multiple initial noise generation models. The initial noise generation model.

Based on this, the initial noise classification model and the multiple initial noise generation models are jointly iteratively trained according to the multiple random voice sequences and the real voice sequences in the different clustering categories to construct the The noise classification model and the multiple noise generation models include: respectively inputting the multiple random speech sequences into the multiple initial noise generation models to generate different types of speech noise; and combining the generated speech noise and the real noise The speech sequences are respectively input to the initial noise and noise recognition model for noise recognition, and the initial noise recognition result is obtained; the speech feature corresponding to the speech noise in the initial noise recognition result is extracted, and it is input to the initial noise classification model for noise classification, Obtain the initial noise classification result; based on the initial noise recognition result and the initial noise classification result, construct the noise recognition accuracy loss function and noise classification accuracy loss function respectively; according to the noise recognition accuracy loss function and noise classification accuracy loss Function to perform joint iterative training on the initial noise recognition model, the initial noise classification model, and the multiple initial noise generation models to construct the noise recognition model, the noise classification model, and the multiple noise generation models respectively. Among them, the preset noise generation model uses a convolutional neural network.

Specifically, by inputting different types of speech noises and real speech sequences under different clustering categories into the initial noise recognition model for noise recognition, the initial noise recognition results are obtained, and then the speech features corresponding to the speech noise in the initial recognition results are extracted. Input it into the preset initial noise classification model for noise classification, and obtain the noise classification result. According to the noise classification result and the noise recognition result, the noise recognition accuracy loss function and the noise classification accuracy loss function are constructed respectively. The specific formula is as follows:

Among them, Lc is the noise recognition accuracy loss function, Lc is the noise classification accuracy loss Korean style, zi is speech noise, xi is the real speech sequence, D stands for the preset noise recognition model, G stands for the preset noise generation model, and c stands for Noise classification model, in order to ensure that the speech noise generated by the noise generation model is closer to the real speech sequence, and increase the difficulty of the recognition of the noise recognition model. The optimization direction of the noise generation model and the noise recognition model is opposite, that is, the noise generation model needs to be minimized The accuracy of the noise recognition model is preset, so its optimization direction is to minimize Lc-Ls, and the training purpose of the noise classification model is to maximize the accuracy of the classification noise, so its optimization direction is to maximize Lc+Ls, so the optimization direction is to maximize Lc+Ls. The above two optimization equations can continuously train the initial noise generation model, the initial noise recognition model and the initial noise classification model to construct the noise generation model, the noise recognition model and the noise classification model.

203. Acquire a voice sequence to be recognized.

Among them, the voice sequence to be recognized is a user voice sequence obtained from a certain scene. The voice sequence may or may not contain voice noise. In order to ensure the subsequent voice recognition results, if the voice sequence to be recognized contains For speech noise, it is necessary to reduce the noise of speech noise. When noise reduction is performed on speech noise, in order to improve the effect of noise reduction of speech noise, the types of speech noise can be further identified so as to be based on the type of speech noise. Select the appropriate noise reduction processing strategy for the type of noise reduction.

Perform noise recognition on the voice sequence, and if the voice sequence contains voice noise, a preset noise classification model is used to determine the noise category corresponding to the voice noise.

Wherein, the noise classification model is obtained through joint training with multiple noise generation models, and the types of speech noise generated by different noise generation models are different. For the embodiment of the present application, in order to determine the noise type corresponding to the speech noise, step 204 specifically includes: performing speech feature extraction on the speech sequence to obtain the speech feature corresponding to the speech sequence; and judging the voice based on the speech feature Whether the sequence contains speech noise; if it contains speech noise, based on the extracted speech features, the noise classification model is used to determine the noise category corresponding to the speech noise.

Specifically, the voice sequence to be recognized is input to the noise recognition model for noise recognition. During the process of noise recognition, the hidden layer in the preset noise recognition model will extract the voice features corresponding to the voice sequence to be recognized, based on the extracted voice The feature determines whether the speech sequence to be recognized contains speech noise, and if it contains speech noise, the extracted speech feature is input into the noise classification model for noise classification to determine the noise category corresponding to the speech noise.

205. Determine an optimal noise reduction processing strategy corresponding to the speech noise based on the noise category, and perform noise reduction processing on the speech noise by using the optimal noise reduction processing strategy.

For this embodiment, according to the noise category corresponding to the determined speech noise, select the noise reduction processing strategy corresponding to the noise category from the preset noise reduction strategy library, and determine it as the optimal noise reduction processing strategy, and then use the The optimal noise reduction processing strategy performs noise reduction processing on the speech noise in the speech sequence to be recognized, so that the optimal noise reduction processing effect can be achieved for the speech noise in different scenarios, and the unified noise reduction processing strategy is avoided. Noise reduction processing effect of speech noise.

Another method for processing speech noise provided by the embodiment of the present application is compared with the current manner in which the same noise reduction strategy is used for noise reduction processing for different types of speech noise, the present application can obtain the voice sequence to be recognized; and Perform noise recognition on the voice sequence. If the voice sequence contains voice noise, use a preset noise classification model to determine the noise category corresponding to the voice noise, wherein the noise classification model is generated from multiple noises. The types of speech noise generated by different noise generation models are different from the joint training of the models; at the same time, based on the noise category, the optimal noise reduction processing strategy corresponding to the speech noise is determined, and the optimal noise reduction is used. The noise processing strategy performs noise reduction processing on the speech noise, so that the noise classification model and multiple noise generation models are jointly trained, so that the noise classification model in this application can identify the types of speech noise in different scenarios , And then can select the optimal noise reduction processing strategy to process the speech noise according to the determined noise category, so as to achieve the optimal noise reduction processing effect.

Further, as a specific implementation of FIG. 1, an embodiment of the present application provides an apparatus for processing speech noise. As shown in FIG. 3, the apparatus includes: an acquisition unit 31, a determination unit 32, and a noise reduction unit 33.

The acquiring unit 31 may be used to acquire a voice sequence to be recognized. The acquiring unit 31 is the main functional module of the device for acquiring the voice sequence to be recognized.

The determining unit 32 may be configured to perform noise recognition on the voice sequence, and if the voice sequence contains voice noise, use a preset noise classification model to determine the noise category corresponding to the voice noise, wherein the The noise classification model is jointly trained with multiple noise generation models, and the types of speech noise generated by different noise generation models are different. The determining unit 32 is a main functional module that performs noise recognition on the voice sequence in the device, and if the voice sequence contains voice noise, it uses a preset noise classification model to determine the main functional module of the noise category corresponding to the voice noise, It is also the core module.

The noise reduction unit 33 may be configured to determine an optimal noise reduction processing strategy corresponding to the speech noise based on the noise category, and use the optimal noise reduction processing strategy to perform noise reduction processing on the speech noise. The noise reduction unit 33 determines the optimal noise reduction processing strategy corresponding to the speech noise based on the noise category in the device, and uses the optimal noise reduction processing strategy to perform noise reduction processing on the speech noise Main functional modules.

Further, in order to determine the noise category corresponding to the speech noise, as shown in FIG. 4, the determination unit 32 includes an extraction module 321, a judgment module 322 and a determination module 323.

The extraction module 321 may be used to extract voice features of the voice sequence to obtain voice features corresponding to the voice sequence to be recognized.

The judgment module 322 may be used to judge whether the speech sequence contains speech noise based on the speech feature.

The determining module 323 may be configured to determine the noise category corresponding to the voice noise by using the noise classification model based on the extracted voice feature if the voice noise is included.

Further, in order to construct a preset noise classification model and multiple noise generation models, the device further includes a clustering unit 34 and a construction unit 35.

The acquiring unit 31 may also be used to acquire a real voice sequence and multiple random voice sequences in a preset voice sample library.

The clustering unit 34 may be used to perform clustering processing on the real voice sequence to obtain real voice sequences in different clustering categories.

The construction unit 35 may be configured to construct the noise classification model and the multiple noise generation models according to the multiple random voice sequences and the real voice sequences in the different clustering categories.

Further, in order to perform clustering processing on real speech sequences, the clustering unit 34 includes: a calculation module 341 and a clustering module 342.

The calculation module 341 may be used to calculate the Euclidean distance between different real speech sequences according to a preset Euclidean distance algorithm.

The clustering module 342 may be used to perform clustering processing on the real speech sequence based on the Euclidean distance to obtain real speech sequences in different clustering categories.

Further, in order to construct a noise classification model and multiple noise generation models, the construction unit 35 includes: a first construction module 351 and a second construction module 352.

The first construction module 351 may be used to separately construct an initial noise classification model and multiple initial noise generation models.

The second construction module 352 may be used to combine the initial noise classification model and the multiple initial noise generation models according to the multiple random voice sequences and the real voice sequences in the different clustering categories Iterative training to construct the noise classification model and the multiple noise generation models.

Further, the second construction module 352 includes: a generation sub-module, an identification sub-module, a classification sub-module, and a construction sub-module.

The generating sub-module may be used to input the multiple random speech sequences into the multiple initial noise generation models to generate different types of speech noise.

The recognition sub-module may be used to input the generated speech noise and the real speech sequence into the initial noise and noise recognition model to perform noise recognition, and obtain the initial noise recognition result.

The classification sub-module can be used to extract the speech features corresponding to the speech noise in the initial noise recognition result, and input it into the initial noise classification model for noise classification, to obtain the initial noise classification result.

The construction sub-module may be used to construct a noise recognition accuracy loss function and a noise classification accuracy loss function based on the initial noise recognition result and the initial noise classification result.

The construction sub-module may also be used to combine the initial noise recognition model, the initial noise classification model, and the multiple initial noise generation models according to the noise recognition accuracy loss function and the noise classification accuracy loss function Iterative training to separately construct a noise recognition model, the noise classification model, and the multiple noise generation models.

It should be noted that, for other corresponding descriptions of various functional modules involved in the apparatus for processing speech noise provided by the embodiment of the present application, reference may be made to the corresponding description of the method shown in FIG. 1, which is not repeated here.

Based on the above-mentioned method shown in FIG. 1, correspondingly, an embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile, and stored thereon. There is a computer program, when the program is executed by the processor, the following steps are realized: obtain the speech sequence to be recognized; obtain the speech sequence to be recognized; A noise classification model is assumed to determine the noise category corresponding to the speech noise, wherein the noise classification model is jointly trained with multiple noise generation models, and the types of speech noise generated by different noise generation models are different; The noise category determines the optimal noise reduction processing strategy corresponding to the speech noise, and uses the optimal noise reduction processing strategy to perform noise reduction processing on the speech noise.

Based on the above-mentioned method shown in FIG. 1 and the embodiment of the apparatus shown in FIG. 3, an embodiment of the present application also provides a physical structure diagram of a computer device. As shown in FIG. 5, the computer device includes: a processor 41, The memory 42 and a computer program that is stored on the memory 42 and can run on the processor, wherein the memory 42 and the processor 41 are both set on the bus 43, and the processor 41 implements the following steps when the program is executed: The voice sequence; perform noise recognition on the voice sequence, and if the voice sequence contains voice noise, use a preset noise classification model to determine the noise category corresponding to the voice noise, wherein the noise classification model is Multiple noise generation models are jointly trained, and the types of speech noise generated by different noise generation models are different; based on the noise category, the optimal noise reduction processing strategy corresponding to the speech noise is determined, and the optimal noise reduction is used The noise processing strategy performs noise reduction processing on the speech noise.

Through the technical solution of the present application, the present application can obtain the voice sequence to be recognized; perform noise recognition on the voice sequence, and if the voice sequence contains voice noise, use a preset noise classification model to determine the voice noise The corresponding noise category, wherein the noise classification model is jointly trained with multiple noise generation models, and the types of speech noise generated by different noise generation models are different; at the same time, based on the noise category, the The optimal noise reduction processing strategy corresponding to the speech noise, and the optimal noise reduction processing strategy is used to reduce the noise of the speech noise, so that the noise classification model and multiple noise generation models are jointly trained, so that The noise classification model in this application can identify the types of speech noise in different scenarios, and then can select the optimal noise reduction processing strategy to process the speech noise according to the determined noise category, and can achieve the optimal noise reduction processing effect .

Claims

A method for processing speech noise, including:

Obtain the voice sequence to be recognized;

Perform noise recognition on the voice sequence. If the voice sequence contains voice noise, use a preset noise classification model to determine the noise category corresponding to the voice noise, wherein the noise classification model is generated from multiple noises. The types of speech noise generated by different noise generation models are different when the models are jointly trained;

Based on the noise category, an optimal noise reduction processing strategy corresponding to the speech noise is determined, and the optimal noise reduction processing strategy is used to perform noise reduction processing on the speech noise.
The method according to claim 1, wherein if the speech sequence contains speech noise, determining the noise category corresponding to the speech noise by using a preset noise classification model comprises:

Performing voice feature extraction on the voice sequence to obtain voice features corresponding to the voice sequence;

Based on the voice feature, determine whether the voice sequence contains voice noise;

If voice noise is included, the noise classification model is used to determine the noise category corresponding to the voice noise based on the extracted voice features.
The method according to claim 1, wherein, before the obtaining the speech sequence to be recognized, the method further comprises:

Obtain the real voice sequence and multiple random voice sequences in the preset voice sample library;

Performing clustering processing on the real voice sequence to obtain real voice sequences in different clustering categories;

The noise classification model and the multiple noise generation models are constructed according to the multiple random voice sequences and the real voice sequences in the different clustering categories.
The method according to claim 3, wherein said performing clustering processing on said real speech sequences to obtain real speech sequences in different clustering categories comprises:

Calculate the Euclidean distance between different real speech sequences according to the preset Euclidean distance algorithm;

Based on the Euclidean distance, clustering is performed on the real speech sequence to obtain real speech sequences in different clustering categories.
3. The method according to claim 3, wherein said pre-establishing said noise classification model and said multiple noise generation models according to said multiple random voice sequences and real voice sequences in said different clustering categories, include:

Build an initial noise classification model and multiple initial noise generation models respectively;

According to the multiple random voice sequences and the real voice sequences in the different clustering categories, the initial noise classification model and the multiple initial noise generation models are jointly iteratively trained to construct the noise classification model and the Describe multiple noise generation models.
The method according to claim 5, wherein said separately constructing an initial noise classification model and a plurality of initial noise generation models comprises:

Build the initial noise recognition model, initial noise classification model and multiple initial noise generation models respectively;

Said performing joint iterative training on the initial noise classification model and the multiple initial noise generation models according to the multiple random speech sequences and the real speech sequences under the different clustering categories to construct the noise classification model And the multiple noise generation models, including:

Input the plurality of random speech sequences into the plurality of initial noise generation models respectively to generate different types of speech noise;

Respectively input the generated speech noise and the real speech sequence into the initial noise and noise recognition model for noise recognition, and obtain an initial noise recognition result;

Extracting the speech features corresponding to the speech noise in the initial noise recognition result, and inputting it into the initial noise classification model for noise classification, to obtain an initial noise classification result;

Based on the initial noise recognition result and the initial noise classification result, respectively constructing a noise recognition accuracy loss function and a noise classification accuracy loss function;

According to the noise recognition accuracy loss function and the noise classification accuracy loss function, the initial noise recognition model, the initial noise classification model, and the multiple initial noise generation models are jointly and iteratively trained to construct the noise recognition model and the noise generation model respectively. The noise classification model and the multiple noise generation models.
The method according to any one of claims 3-6, wherein the plurality of random speech sequences obey a Gaussian distribution.
A processing device for speech noise, including:

The acquiring unit is used to acquire the voice sequence to be recognized;

The determining unit is configured to perform noise recognition on the voice sequence, and if the voice sequence contains voice noise, use a preset noise classification model to determine the noise category corresponding to the voice noise, wherein the noise classification model is It is obtained by joint training with multiple noise generation models, and the types of speech noise generated by different noise generation models are different;

The noise reduction unit is configured to determine an optimal noise reduction processing strategy corresponding to the speech noise based on the noise category, and use the optimal noise reduction processing strategy to perform noise reduction processing on the speech noise.
A computer-readable storage medium with a computer program stored thereon, and when the computer program is executed by a processor, the steps of a method for processing speech noise are realized:

Obtain the voice sequence to be recognized;

Perform noise recognition on the voice sequence. If the voice sequence contains voice noise, use a preset noise classification model to determine the noise category corresponding to the voice noise, wherein the noise classification model is generated from multiple noises. The types of speech noise generated by different noise generation models are different when the models are jointly trained;

Based on the noise category, an optimal noise reduction processing strategy corresponding to the speech noise is determined, and the optimal noise reduction processing strategy is used to perform noise reduction processing on the speech noise.
9. The computer-readable storage medium according to claim 9, wherein if the speech sequence contains speech noise, using a preset noise classification model to determine the noise category corresponding to the speech noise comprises:

Performing voice feature extraction on the voice sequence to obtain voice features corresponding to the voice sequence;

Based on the voice feature, determine whether the voice sequence contains voice noise;

If voice noise is included, the noise classification model is used to determine the noise category corresponding to the voice noise based on the extracted voice features.
9. The computer-readable storage medium according to claim 9, wherein, before said obtaining the speech sequence to be recognized, the method further comprises:

Obtain the real voice sequence and multiple random voice sequences in the preset voice sample library;

Performing clustering processing on the real voice sequence to obtain real voice sequences in different clustering categories;

The noise classification model and the multiple noise generation models are constructed according to the multiple random voice sequences and the real voice sequences in the different clustering categories.
11. The computer-readable storage medium according to claim 11, wherein said performing clustering processing on said real speech sequences to obtain real speech sequences in different clustering categories comprises:

The Euclidean distance between different real speech sequences is calculated according to a preset Euclidean distance algorithm; based on the Euclidean distance, the real speech sequence is clustered to obtain real speech sequences in different clustering categories.
11. The computer-readable storage medium according to claim 11, wherein the noise classification model and the multiple Noise generation models, including:

Build an initial noise classification model and multiple initial noise generation models respectively;

According to the multiple random voice sequences and the real voice sequences in the different clustering categories, the initial noise classification model and the multiple initial noise generation models are jointly iteratively trained to construct the noise classification model and the Describe multiple noise generation models.
The computer-readable storage medium according to claim 13, wherein said separately constructing an initial noise classification model and a plurality of initial noise generation models comprises:

Build the initial noise recognition model, initial noise classification model and multiple initial noise generation models respectively;

Said performing joint iterative training on the initial noise classification model and the multiple initial noise generation models according to the multiple random speech sequences and the real speech sequences under the different clustering categories to construct the noise classification model And the multiple noise generation models, including:

Input the plurality of random speech sequences into the plurality of initial noise generation models respectively to generate different types of speech noise;

Respectively input the generated speech noise and the real speech sequence into the initial noise and noise recognition model for noise recognition, and obtain an initial noise recognition result;

Extracting the speech features corresponding to the speech noise in the initial noise recognition result, and inputting it into the initial noise classification model for noise classification, to obtain an initial noise classification result;

Based on the initial noise recognition result and the initial noise classification result, respectively constructing a noise recognition accuracy loss function and a noise classification accuracy loss function;

According to the noise recognition accuracy loss function and the noise classification accuracy loss function, the initial noise recognition model, the initial noise classification model, and the multiple initial noise generation models are jointly and iteratively trained to construct the noise recognition model and the noise generation model respectively. The noise classification model and the multiple noise generation models.
14. The computer-readable storage medium according to any one of claims 11-14, wherein the plurality of random speech sequences obey a Gaussian distribution.
A computer device includes a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and when the computer program is executed by the processor, the steps of a method for processing speech noise are realized:

Obtain the voice sequence to be recognized;

Perform noise recognition on the voice sequence. If the voice sequence contains voice noise, use a preset noise classification model to determine the noise category corresponding to the voice noise, wherein the noise classification model is generated from multiple noises. The types of speech noise generated by different noise generation models are different when the models are jointly trained;

Based on the noise category, an optimal noise reduction processing strategy corresponding to the speech noise is determined, and the optimal noise reduction processing strategy is used to perform noise reduction processing on the speech noise.
The computer device according to claim 16, wherein if the speech sequence contains speech noise, using a preset noise classification model to determine the noise category corresponding to the speech noise comprises:

Performing voice feature extraction on the voice sequence to obtain voice features corresponding to the voice sequence;

Based on the voice feature, determine whether the voice sequence contains voice noise;

If voice noise is included, the noise classification model is used to determine the noise category corresponding to the voice noise based on the extracted voice features.
The computer device according to claim 16, wherein, before said obtaining the speech sequence to be recognized, the method further comprises:

Obtain the real voice sequence and multiple random voice sequences in the preset voice sample library;

Performing clustering processing on the real voice sequence to obtain real voice sequences in different clustering categories;

The noise classification model and the multiple noise generation models are constructed according to the multiple random voice sequences and the real voice sequences in the different clustering categories.
18. The computer device according to claim 18, wherein said performing clustering processing on said real speech sequences to obtain real speech sequences in different clustering categories comprises: calculating different real speech sequences according to a preset Euclidean distance algorithm Euclidean distance between; based on the Euclidean distance, clustering the real speech sequence to obtain real speech sequences in different clustering categories.
18. The computer device of claim 18, wherein the noise classification model and the multiple noise generation models are pre-built based on the multiple random voice sequences and real voice sequences in the different clustering categories ,include:

Build an initial noise classification model and multiple initial noise generation models respectively;

According to the multiple random voice sequences and the real voice sequences in the different clustering categories, the initial noise classification model and the multiple initial noise generation models are jointly iteratively trained to construct the noise classification model and the Describe multiple noise generation models.