CN114299932A

CN114299932A - Voice data processing method and device, computer equipment and storage medium

Info

Publication number: CN114299932A
Application number: CN202111605751.2A
Authority: CN
Inventors: 马夺
Original assignee: Shenzhen Pudu Technology Co Ltd
Current assignee: Shenzhen Pudu Technology Co Ltd
Priority date: 2021-12-25
Filing date: 2021-12-25
Publication date: 2022-04-08

Abstract

The present application relates to a voice data processing method, apparatus, computer device, storage medium and computer program product. The method comprises the following steps: acquiring pure voice sample data and noisy voice sample data in a target field; respectively preprocessing the voice sample data with noise and the pure voice sample data to obtain a noise image corresponding to the voice sample data with noise and a pure image corresponding to the pure voice sample data; training a generating countermeasure network according to the noisy image and the clean image to obtain a domain conversion model; and inputting the pure voice data into the domain conversion model to obtain the voice data with noise in the target field. According to the method and the device, a large amount of noisy voice data in the target field can be obtained, the generated noisy voice data can be more fit with the actual application scene of the target field, and the voice data in the actual scene can be more accurately recognized through the voice recognition model obtained through the training of the noisy voice data in the target field.

Description

Voice data processing method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method and an apparatus for processing speech data, a computer device, and a storage medium.

Background

With the rapid development of deep learning technology and the convenience of voice information operation and transmission, voice interaction is also increasingly applied to the daily life of people. In the voice interaction process, voice recognition is the entrance of human-machine interaction, the voice recognition technology is the technology which enables a machine to convert voice information into corresponding text or commands through the recognition and understanding process, the quality of voice recognition directly determines the experience of voice interaction, and therefore the voice recognition is the key technology of voice interaction.

In speech recognition, speech information is converted into text information by a speech recognition model, and since the speech information is affected by an acoustic environment when it is generated, the speech recognition model is usually trained using speech data containing noise in order to improve the recognition accuracy of the speech recognition model, and therefore, when a speech recognition model in a specific field is trained, the degree of naturalness between the training data and a corresponding scene is important. The voice data of a specific scene constructed in the conventional art is not enough to reflect a real scene.

Disclosure of Invention

In view of the above, it is necessary to provide a voice data processing method, an apparatus, a computer device, a computer readable storage medium, and a computer program product, which can acquire voice data more conforming to a real scene, in order to solve the above technical problems.

In a first aspect, the present application provides a method for processing voice data. The method comprises the following steps:

acquiring pure voice sample data and noisy voice sample data in a target field;

respectively preprocessing the voice sample data with noise and the pure voice sample data to obtain a noise image corresponding to the voice sample data with noise and a pure image corresponding to the pure voice sample data;

training a generating countermeasure network according to the noisy image and the clean image to obtain a domain conversion model;

and inputting the pure voice data into the domain conversion model to obtain the voice data with noise in the target field.

In one embodiment, the pre-processing the noisy speech sample data and the clean speech sample data respectively to obtain a noisy image corresponding to the noisy speech sample data and a clean image corresponding to the clean speech sample data includes:

respectively performing frequency domain conversion on the voice sample data with noise and the pure voice sample data to obtain voice frequency domain data with noise corresponding to the voice sample data with noise and pure voice frequency domain data corresponding to the pure voice sample data;

respectively carrying out data processing on the pure voice frequency domain data and the voice frequency domain data with noise to obtain a pure voice amplitude spectrum corresponding to the pure voice frequency domain data and a voice amplitude spectrum with noise corresponding to the voice frequency domain data with noise;

and obtaining the noisy image corresponding to the noisy speech magnitude spectrum according to the noisy speech magnitude spectrum, and obtaining the clean image corresponding to the clean speech magnitude spectrum according to the clean speech magnitude spectrum.

In one embodiment, the training the generative countermeasure network according to the noisy image and the clean image to obtain a domain transformation model includes:

determining discrimination model loss, generation model loss and characteristic region loss according to the noisy image and the pure image;

determining the comprehensive loss of the generative countermeasure network according to the discriminant model loss, the generative model loss and the characteristic region loss;

and training the generative countermeasure network according to the comprehensive loss, and obtaining the domain conversion model when a convergence condition is met.

In one embodiment, determining feature region loss from the noisy image and the clean image comprises:

selecting a characteristic region on the image with noise according to a preset size to obtain a plurality of characteristic regions with noise corresponding to the image with noise, and correspondingly selecting a characteristic region on the pure image according to the preset size to obtain a plurality of pure characteristic regions corresponding to the pure image;

inputting the noisy characteristic region and the clean characteristic region into a generating network in a generating countermeasure network to obtain a generating noisy characteristic corresponding to the noisy characteristic region and a generating clean characteristic corresponding to the clean characteristic region;

and determining the characteristic region loss according to the generated noisy characteristic and the generated pure characteristic.

In one embodiment, the determining the feature region loss according to the generated noisy feature and the generated clean feature includes:

taking any one generated noisy feature as a reference value, and comparing the similarity of each generated pure feature with the reference value to obtain a sub-feature region loss corresponding to the generated noisy feature;

and obtaining the characteristic region loss according to the sub-characteristic region losses corresponding to all the generated noisy characteristics.

In one embodiment, the comparing, with any one of the generated noisy features as a reference value, the similarity between each generated clean feature and the reference value to obtain a sub-feature region loss corresponding to the generated noisy feature includes:

taking any one of the generated noisy features as a reference value, and comparing the similarity of each generated pure feature with the reference value;

when the generated clean feature corresponds to the position of the generated noisy feature, comparing the similarity of the generated noisy feature and the generated clean feature to obtain a positive feature area loss;

when the positions of the generated clean feature and the generated clean feature do not correspond to each other, performing similarity comparison on the generated clean feature and the generated clean feature to obtain negative feature regional loss;

and obtaining the sub-feature region loss according to the positive feature region loss and the negative feature region loss.

In a second aspect, the present application further provides a speech data processing apparatus. The device comprises:

the sample acquisition module is used for acquiring pure voice sample data and noisy voice sample data in the target field;

the preprocessing module is used for respectively preprocessing the voice sample data with noise and the pure voice sample data to obtain a voice image with noise corresponding to the voice sample data with noise and a pure image corresponding to the pure voice sample data;

the model training module is used for training the generating countermeasure network according to the noisy image and the clean image to obtain a domain conversion model;

and the data generation module is used for inputting the pure voice data into the domain conversion model to obtain the voice data with noise in the target field.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:

acquiring pure voice sample data and noisy voice sample data in a target field;

In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

acquiring pure voice sample data and noisy voice sample data in a target field;

In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:

acquiring pure voice sample data and noisy voice sample data in a target field;

According to the voice data processing method, the voice data processing device, the computer equipment, the storage medium and the computer program product, pure voice sample data and noisy voice sample data in a target field are obtained; respectively preprocessing the voice sample data with noise and the pure voice sample data to obtain a noise image corresponding to the voice sample data with noise and a pure image corresponding to the pure voice sample data; training a generating countermeasure network according to the noisy image and the clean image to obtain a domain conversion model; and inputting the pure voice data into the domain conversion model to obtain the voice data with noise in the target field. The method and the device have the advantages that the images obtained after preprocessing are respectively carried out on the noisy voice sample data and the pure voice sample data of a small number of target fields, the generative countermeasure network is trained, the domain conversion model is obtained, the pure voice data are input into the domain conversion model, a large number of noisy voice data of the target fields can be obtained, the generated noisy voice data are more attached to the actual application scene of the target fields, and the voice data in the actual scene can be more accurately recognized through the voice recognition model obtained through the training of the noisy voice data of the target fields.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a method for processing speech data;

FIG. 2 is a flow diagram illustrating a method for processing speech data in one embodiment;

FIG. 3 is a flow chart illustrating step 204 in one embodiment;

FIG. 4 is a flow chart illustrating step 206 in one embodiment;

FIG. 5 is a flow diagram illustrating a process for determining loss of a feature region in one embodiment;

FIG. 6 is a flow chart illustrating step 506 in one embodiment;

FIG. 7 is a flow diagram illustrating a step 602, according to an embodiment;

FIG. 8 is a flow diagram illustrating a process for determining a loss of a sub-feature region in one embodiment;

FIG. 9 is a flowchart illustrating a method of processing voice data according to another embodiment;

FIG. 10 is a block diagram showing the structure of a speech data processing apparatus according to an embodiment;

FIG. 11 is a diagram illustrating an internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The voice data processing method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104, or may be located on the cloud or other network server. The server 104 acquires pure voice sample data and noisy voice sample data of the target field; respectively preprocessing the voice sample data with noise and the pure voice sample data to obtain a noise image corresponding to the voice sample data with noise and a pure image corresponding to the pure voice sample data; training the generating countermeasure network according to the noisy image and the clean image to obtain a domain conversion model; and inputting the pure voice data into the domain conversion model to obtain the voice data with noise in the target domain. Server 104 may transmit the resulting noisy speech data for the target domain to terminal 102. The terminal 102 may be, but not limited to, a robot, various personal computers, a notebook computer, a smart phone, a tablet computer, an internet of things device, and a portable wearable device, where the internet of things device may be a smart speaker, a smart television, a smart air conditioner, a smart car device, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

In practical scenarios, the speech data received by the speech recognition model is mixed data of target speech and other noise data, such as background human voice and background music, and the speech recognition capability is poor. At present, a voice recognition system with relatively strong robustness is usually trained by using a large amount of tag data, but in the process of collecting a large amount of tag data, the manpower and material resources are high, and for application scenes in a specific field, such as catering scenes, voice interaction data of a robot is difficult to obtain. At present, the traditional voice data amplification method includes adding noise to pure voice data at a certain signal-to-noise ratio, and the method is not flexible enough to process different application scenes, and moreover, the simulated voice data is difficult to keep consistent with a data spectrum under a real scene, and the method is not enough to reflect the real scene.

In one embodiment, as shown in fig. 2, a method for processing voice data is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step 202, obtaining pure voice sample data and noisy voice sample data of the target field.

The server obtains pure voice sample data sent by the terminal and noisy voice sample data in the target field. The pure voice sample data refers to a voice signal without noise or with a noise ratio lower than a preset ratio. The noisy speech sample data in the target field refers to a speech signal mixed with pure speech and noise having an interference effect on the pure speech in a specific field. Alternatively, the noise includes, but is not limited to, interference noise, which may refer to external noise or noise transmitted by a room enclosure, or background noise, which may be noise around a listener, and general indoor noise or traffic noise introduced into the room from the outside belong to the background noise, which is general noise difficult to avoid. The noise may also be transmission noise, which refers to the influence of various peripheral interferences on the data during transmission, such as noise occurring in both digital and analog circuits. It should be noted that, in the embodiment, the noisy speech sample data in the target field does not need to be labeled with a corresponding text, and the data volumes of the clean speech sample data and the noisy speech sample data are both data volumes with a duration of 30 minutes or several hours.

And 204, respectively preprocessing the noisy speech sample data and the pure speech sample data to obtain a noisy image corresponding to the noisy speech sample data and a pure image corresponding to the pure speech sample data.

Preprocessing the voice sample data with noise to obtain a voice image with noise corresponding to the voice sample data with noise; and preprocessing the pure voice sample data to obtain a noisy image corresponding to the pure voice sample data. In general, a speech signal is sampled at a set sampling period to obtain a corresponding analog signal, and the corresponding analog signal is usually processed in a time domain, and the speech signal in the time domain needs to be converted into a frequency domain, so as to further separate a magnitude spectrum of the speech signal, and obtain an image corresponding to the speech signal according to the magnitude spectrum of the speech signal. That is, a noisy speech amplitude spectrum corresponding to the noisy speech sample data is obtained according to the noisy speech sample data, and a noisy image corresponding to the noisy speech sample data is obtained according to the noisy speech amplitude spectrum; similarly, a pure voice magnitude spectrum corresponding to the pure voice sample data is obtained according to the pure voice sample data, and a pure image corresponding to the pure voice sample data is obtained according to the pure voice magnitude spectrum.

Optionally, a noisy speech amplitude spectrum and a noisy speech phase spectrum corresponding to the noisy speech sample data may be obtained according to the noisy speech sample data, and a noisy image corresponding to the noisy speech sample data may be obtained according to the noisy speech amplitude spectrum and the noisy speech phase spectrum; similarly, a pure voice amplitude spectrum and a pure voice phase spectrum corresponding to the pure voice sample data are obtained according to the pure voice sample data, and a pure image corresponding to the pure voice sample data is obtained according to the pure voice amplitude spectrum and the pure voice phase spectrum. The preprocessing of the voice data further comprises pre-emphasis processing, windowing and framing operations, the window functions comprise Hamming windows, rectangular windows (Hamming) and Hanning windows (Hanning), and the specific preprocessing method can be selected according to actual processing requirements.

And step 206, training the generating countermeasure network according to the noisy image and the clean image to obtain a domain conversion model.

A Generative Adaptive Network (GAN) is a method of countermeasure training to make samples generated by the GAN obey true data distribution. In generating the countermeasure network, there are two networks for the countermeasure training. One is to judge the network, and the aim is to judge whether a sample is from real data or generated by the network as accurately as possible; the other is to generate a network, and the aim is to generate a sample which can not distinguish the source of the network as much as possible. The two networks with opposite targets are continuously trained alternately. When the data is finally converged, if the judging network can not judge the source of a sample any more, the method is equivalent to that the generating network can generate the sample which accords with the real data distribution.

In this embodiment, the clean voice sample data and the noisy voice sample data in the target field both include a plurality of pieces of corresponding voice data, and after the noisy voice sample data and the clean voice sample data are respectively preprocessed, each piece of voice data corresponds to one image, that is, one noisy voice sample data is preprocessed to obtain one noisy image, and one clean voice sample data is preprocessed to obtain one clean image. And inputting an image pair formed by one noisy image and one clean image into the generative confrontation network for training each time to obtain a domain conversion model. The domain conversion model is a speech model that can generate noisy speech data of the target domain.

And step 208, inputting the pure voice data into the domain conversion model to obtain the voice data with noise in the target domain.

After the domain conversion model is obtained, the pure voice data is input into the domain conversion model, and the voice data with noise in the target domain can be obtained. The clean speech data may be clean speech data collected autonomously in a quiet environment, or may be open-source clean speech data, for example, clean speech data selected from an AISHELL-ASR0009-OS1 open-source chinese speech database, and the clean speech data may belong to a different domain than the target domain.

The voice data processing method comprises the steps of obtaining pure voice sample data and noisy voice sample data in a target field; respectively preprocessing the voice sample data with noise and the pure voice sample data to obtain a noise image corresponding to the voice sample data with noise and a pure image corresponding to the pure voice sample data; training the generating countermeasure network according to the noisy image and the clean image to obtain a domain conversion model; and inputting the pure voice data into the domain conversion model to obtain the voice data with noise in the target field. The method and the device have the advantages that the images obtained after preprocessing are respectively carried out on the noisy voice sample data and the pure voice sample data in a small amount of target fields, the generative countermeasure network is trained, the domain conversion model is obtained, the pure voice data are input into the domain conversion model, a large amount of noisy voice data in the target fields can be obtained, the generated noisy voice data are more attached to the actual application scene of the target fields, and the voice data in the actual scene can be more accurately recognized through the voice recognition model obtained through the noisy voice data training in the target fields.

In an embodiment, as shown in fig. 3, the step 204 of respectively preprocessing the noisy speech sample data and the clean speech sample data to obtain a noisy image corresponding to the noisy speech sample data and a clean image corresponding to the clean speech sample data includes:

step 302, respectively performing frequency domain conversion on the noisy speech sample data and the pure speech sample data to obtain noisy speech frequency domain data corresponding to the noisy speech sample data and pure speech frequency domain data corresponding to the pure speech sample data.

In this embodiment, the noisy speech sample data and the clean speech sample data are both time domain signals, and therefore, the time domain signals need to be converted into frequency domain signals for further processing. The time domain is a function describing a mathematical function or a physical signal versus time; the frequency domain is a description of the relationship between the signal and the frequency. The conversion from the time domain to the frequency domain can be achieved by a fourier series or fourier transform. In particular, the fourier transform implements a global transform, either completely in the time domain or completely in the frequency domain, which is unable to express the time-frequency local properties of the signal, so a short-time fourier transform can be used, the basic idea of which is: assuming that the non-stationary signal f (t) is stationary during a short time interval of the analysis window w (t), if the analysis window function is shifted such that f (t) w (t- τ) is also stationary for different finite time periods, the power spectrum of the non-stationary signal at each different instant can be calculated.

And 304, respectively performing data processing on the pure voice frequency domain data and the voice frequency domain data with noise to obtain a pure voice amplitude spectrum corresponding to the pure voice frequency domain data and a voice amplitude spectrum with noise corresponding to the voice frequency domain data with noise.

In one possible example, after short-time fourier transform is performed on the noise-containing voice sample data, a corresponding complex spectrum is obtained, the complex spectrum is the noise-containing voice frequency domain data, and then a module is performed on the complex spectrum, so that a corresponding noise-containing voice amplitude spectrum can be obtained; similarly, short-time Fourier transform is carried out on the pure voice sample data to obtain corresponding pure voice frequency domain data, and a module operation is carried out on the pure voice frequency domain data to obtain a corresponding pure voice amplitude spectrum.

And step 306, obtaining a noisy image corresponding to the noisy speech magnitude spectrum according to the noisy speech magnitude spectrum, and obtaining a clean image corresponding to the clean speech magnitude spectrum according to the clean speech magnitude spectrum.

In this embodiment, the amplitude spectrum is equivalent to a gray image, and the purpose is to obtain a color image in a corresponding color mode through processing the gray image. The obtained noisy speech amplitude spectrum can be directly stored as a corresponding image to be used as a noisy image; similarly, the obtained pure speech magnitude spectrum can be directly saved as an image, and the obtained image is used as a pure image. The noisy image and the clean image may be images in color modes such as RGB, HSB, or LAB.

In one embodiment, as shown in FIG. 4, the step 206 of training the generative confrontation network according to the noisy image and the clean image to obtain the domain transformation model includes:

step 402, determining discriminant model loss, generative model loss and characteristic region loss according to the noisy image and the clean image.

In conjunction with the foregoing description of the generative confrontation network, the goal of the discrimination network is to distinguish whether a sample is from the true distribution or from the generative model, and thus, the discrimination network is actually a two-class classifier. In an optional embodiment, the clear image is input into a generating network to obtain a generating image, and a first discriminant loss is obtained according to a discriminant result of the generating image discriminated by the discriminant network; and judging the judgment result of the noisy image according to the judgment network to obtain a second judgment loss, and obtaining the judgment model loss according to the first judgment loss and the second judgment loss.

Optionally, the clean image is input into a generation network to obtain a generation clean image, the generation clean image is input into a discrimination network to obtain a first discrimination result pred _ fake, and a first discrimination loss _ D _ fake corresponding to the first discrimination result is a first discrimination resultThe square of the absolute value of 0 subtracted from pred _ fake, i.e., loss _ D _ fake ═ pred _ fake-0.0²(ii) a Inputting the image with noise into the discrimination network to obtain a second discrimination result pred _ real, and then, a second discrimination loss _ D _ real corresponding to the second discrimination result is the square of the absolute value of the second discrimination result minus 1, i.e. loss _ D _ real is | pred _ real-1.0 |)²Then, the discriminant model loss is the sum of the first discriminant loss and the second discriminant loss, i.e., the discriminant loss _ D ═ loss _ D _ fake + loss _ D _ real.

And the target of the generated network is just opposite to the judgment network, namely the judgment network judges the sample generated by the judgment network as a real sample. In an alternative embodiment, the clean image is input into the generation network to obtain a generation image, the generation network determines a determination result of the generation image to obtain a first determination result, and a first generation loss is obtained according to the first determination result, for example, the first generation loss _ G _ gan is a square of an absolute value obtained by subtracting 1 from pred _ fake of the first determination result, that is, loss _ G _ gan ═ pred _ fake-1.0²And obtaining the generative model loss according to the first generative loss. The generative model loss may be the first generative loss, or may be a formula distortion in which the first generative loss participates in the calculation.

And step 404, determining the comprehensive loss of the generative countermeasure network according to the discriminant model loss, the generative model loss and the characteristic region loss.

The synthetic loss of the generative countermeasure network may be a result of simply superimposing the discriminant model loss, the generative model loss, and the feature area loss, or a result of performing weighted calculation on the discriminant model loss, the generative model loss, and the feature area loss.

And 406, training the generative countermeasure network according to the comprehensive loss, and obtaining a domain conversion model when a convergence condition is met.

And the server trains the generative confrontation network according to the comprehensive loss, and when the model convergence condition is met, the obtained generative network is the domain conversion model. The convergence condition may be a preset training number of the generative confrontation network, or at least one of a discriminant model loss, a generative model loss, or a characteristic region loss that satisfies the preset condition.

In one embodiment, as shown in FIG. 5, determining feature region loss from noisy and clean images includes:

step 502, selecting a characteristic region on the noisy image according to a preset size to obtain a plurality of noisy characteristic regions corresponding to the noisy image, and correspondingly selecting a characteristic region on the clean image according to a preset size to obtain a plurality of clean characteristic regions corresponding to the clean image.

In this embodiment, a preset number of feature regions may be selected on a noisy image according to a preset size to obtain a plurality of noisy feature regions corresponding to the noisy image, and similarly, a preset number of feature regions may be selected on a clean image according to a preset size to obtain a plurality of clean feature regions corresponding to the clean image, where it should be noted that the number of the plurality of noisy feature regions is the same as that of the plurality of clean feature regions, and positions in corresponding diagrams correspond to each other one to one, and each feature region has a corresponding position code in the corresponding diagram, and if the positions of the noisy feature regions correspond to those of the clean feature regions, the corresponding position codes are the same.

Step 504, inputting the noisy characteristic region and the clean characteristic region into a generation network in the generative confrontation network, and obtaining a generative noisy characteristic corresponding to the noisy characteristic region and a generative clean characteristic corresponding to the clean characteristic region.

In this embodiment, a noisy feature region is input into the generation network of the generative countermeasure network to obtain a corresponding generated noisy feature, and similarly, a clean feature region is input into the generation network of the generative countermeasure network to obtain a corresponding generated clean feature region.

Step 506, determining the loss of the characteristic region according to the generated noisy characteristic and the generated clean characteristic.

A feature region loss is determined by a generative noisy feature and a generative clean feature. Alternatively, one generated noisy feature or one generated clean feature may be regarded as a corresponding matrix, and the corresponding feature area loss is obtained through an operation between the corresponding matrices, for example, multiplication between the generated noisy feature and the generated clean feature. It should be noted that, in each layer of the generated network, a corresponding noisy characteristic region and a corresponding clean characteristic region are selected, and a generated noisy characteristic and a generated clean characteristic are correspondingly obtained, so that a corresponding characteristic region loss is obtained.

In one embodiment, as shown in fig. 6, the step 506 of determining the loss of the feature region according to the generated noisy feature and the generated clean feature includes:

and step 602, taking any one generated noisy feature as a reference value, and comparing the similarity of each generated clean feature with the reference value to obtain a sub-feature region loss corresponding to the generated noisy feature.

And step 604, obtaining the characteristic area loss according to the sub-characteristic area losses corresponding to all the generated noisy characteristics.

In this embodiment, a noisy image and a clean image of the present input generative countermeasure network are explained, one of the generative noisy features is sequentially selected as a reference value, each generative clean feature is compared with the reference value for similarity, and a result obtained by the comparison can be used as a sub-feature region loss corresponding to the generative noisy feature until all sub-feature region losses corresponding to the generative noisy feature are obtained, and a feature region loss of the present operation is obtained according to all the sub-feature region losses. When the input noisy image and clean image are changed, the repeated processing is carried out. It should be noted that, in the actual training process, when one sub-feature region loss is obtained, the result of one sub-feature region loss may be fed back to the generative confrontation network, or when all sub-feature region losses are obtained, the results of all sub-feature region losses may be fed back to the generative confrontation network.

In one embodiment, as shown in fig. 7, the step 602 of comparing the similarity of each generated clean feature with the reference value to obtain the sub-feature region loss corresponding to the generated noisy feature includes:

and step 702, taking any one of the generated noisy features as a reference value, and comparing the similarity of each generated clean feature with the reference value.

Step 704, when the generated clean feature corresponds to the generated noisy feature, comparing the similarity of the generated noisy feature and the generated clean feature to obtain the positive feature area loss.

And when the generated clean feature is identified to be consistent with the position code of the generated noisy feature, indicating that the generated clean feature corresponds to the position of the generated noisy feature, and comparing the similarity of the generated noisy feature and the generated clean feature to obtain the positive feature area loss. For the selected generated noisy feature reference value, only one generated clean feature may be located corresponding to the selected generated noisy feature reference value, i.e., only one positive feature is lost.

And 706, when the positions of the generated clean feature and the generated clean feature do not correspond to each other, comparing the similarity of the generated clean feature and the generated clean feature to obtain the negative feature regional loss.

And when the generated clean feature is identified to be inconsistent with the position code of the generated noisy feature, indicating that the generated clean feature is not corresponding to the position of the generated noisy feature, and comparing the similarity of the generated noisy feature and the generated clean feature to obtain the negative feature regional loss. Except that the position of one generated pure feature is the same as the position of the selected generated noisy feature reference value, the positions of other generated pure features are different from the position of the selected generated noisy feature reference value, and the negative feature loss of the preset number minus 1 is obtained.

And 708, obtaining sub-feature region loss according to the positive feature region loss and the negative feature region loss.

In this embodiment, the positive characteristic region loss represents that the learning direction of the generative confrontation network is the forward direction, and conversely, the negative characteristic region loss represents that the learning direction of the generative confrontation network is the backward direction, so that in the model training process, the more similar the learning of the forward characteristics is, the more different the learning of the backward characteristics is, which is beneficial to improving the training precision of the generative confrontation network. Further, as shown in fig. 8, the selected position code of the generated clean feature is a, and when the position code of the generated clean feature is also a, the generated clean feature is compared with the generated clean feature in similarity to obtain a positive feature area loss, otherwise, a negative feature loss is obtained.

In one embodiment, as shown in fig. 9, a voice data processing method includes:

and step 902, acquiring pure voice sample data and noisy voice sample data in the target field.

For example, 30 minutes of clean voice sample data, which may be understood as audio data, and 30 minutes of noisy voice sample data of the target domain may be acquired.

And 904, respectively carrying out short-time Fourier transform on the pure voice sample data and the noisy voice sample data in the target field to obtain a pure image corresponding to the pure voice sample data and a noisy image corresponding to the noisy voice sample data.

And respectively carrying out short-time Fourier transform on the pure voice sample data and the noisy voice sample data in the target field to obtain a pure voice amplitude spectrum corresponding to the pure voice sample data and a noisy voice amplitude spectrum corresponding to the noisy voice sample data, carrying out image transformation on the pure voice amplitude spectrum to obtain a corresponding pure image, and carrying out image transformation on the noisy voice amplitude spectrum to obtain a corresponding noisy image.

Step 906, inputting the clean image and the image with noise into a generative countermeasure network for training to obtain a domain conversion model.

Respectively obtaining the discriminant model loss, the generative model loss and the characteristic region loss of the generative discrimination network according to the pure image and the noisy image, determining the comprehensive loss of the generative confrontation network according to the discriminant model loss, the generative model loss and the characteristic region loss, training the generative confrontation network according to the comprehensive loss, and obtaining the domain conversion model when the convergence condition is met.

Step 908, inputting the clean voice data into the domain conversion model to obtain the voice data with noise in the target domain.

Pure voice data is input into a domain conversion model, the domain conversion model is a trained generation network, and finally noisy voice data of a target field can be obtained, so that original noisy voice sample data is enriched.

Step 910, perform volume normalization on the noisy speech data in the target domain to obtain a first noisy speech data.

And carrying out volume normalization processing on the voice data with noise in the target field by setting a filter function, so that the volume of the voice data with noise in the target field is consistent with that of the pure voice sample data, namely the first voice data with noise has relatively high volume.

Step 912, obtaining a transcription text corresponding to the first noisy speech data, and training the initial speech recognition model according to the first noisy speech data and the corresponding transcription text to obtain a speech recognition model in the target field.

And acquiring a corresponding transcription text according to the first voice data with noise, and training the initial voice recognition model according to the first voice data with noise and the corresponding transcription text to obtain a voice recognition model in the target field. The initial speech recognition model may be a speech recognition model such as a transformer (Convolution-enhanced transformer), a CTC (connection Temporal Classification), an RNN-T (RNN-transmitter), or U2+ +. The voice recognition model of the target field can accurately recognize the voice data in the target field using scene.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides a voice data processing apparatus for implementing the above-mentioned voice data processing method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme described in the above method, so specific limitations in one or more embodiments of the voice data processing device provided below can refer to the limitations of the voice data processing method in the foregoing, and details are not described here.

In one embodiment, as shown in fig. 10, there is provided a voice data processing apparatus including: a sample acquisition module 1002, a pre-processing module 1004, a model training module 1006, and a data generation module 1008, wherein:

a sample obtaining module 1002, configured to obtain pure voice sample data and noise-containing voice sample data in a target field;

the preprocessing module 1004 is configured to respectively preprocess the noisy speech sample data and the pure speech sample data to obtain a noisy image corresponding to the noisy speech sample data and a pure image corresponding to the pure speech sample data;

a model training module 1006, configured to train the generative countermeasure network according to the noisy image and the clean image to obtain a domain transformation model;

and a data generating module 1008, configured to input the pure speech data into the domain conversion model to obtain noisy speech data in the target domain.

In one embodiment, the preprocessing module 1004 is further configured to:

In one embodiment, the model training module 1006 is further configured to:

In one embodiment, model training module 1006 includes:

and the characteristic region selection unit is used for selecting a characteristic region on the image with noise according to a preset size to obtain a plurality of characteristic regions with noise corresponding to the image with noise, and correspondingly selecting a characteristic region on the pure image according to a preset size to obtain a plurality of pure characteristic regions corresponding to the pure image.

And the characteristic region generating unit is used for inputting the noisy characteristic region and the clean characteristic region into a generating network in the generating countermeasure network to obtain a generating type noisy characteristic corresponding to the noisy characteristic region and a generating type clean characteristic corresponding to the clean characteristic region.

And the characteristic region loss unit is used for determining the characteristic region loss according to the generated noisy characteristic and the generated pure characteristic.

In one embodiment, a feature region loss unit includes:

the sub-feature area component is used for comparing the similarity of each generated clean feature with a reference value by taking any generated noisy feature as the reference value to obtain a sub-feature area loss corresponding to the generated noisy feature;

and the characteristic region loss component is used for obtaining the characteristic region loss according to the sub-characteristic region losses corresponding to all the generated noisy characteristics.

In one embodiment, the sub-feature region component is further to:

In one embodiment, the voice data processing apparatus further includes:

and the normalization module is used for carrying out volume normalization processing on the voice data with noise in the target field to obtain first voice data with noise.

And the voice recognition model training module is used for acquiring a transcription text corresponding to the first noisy voice data, and training the initial voice recognition model according to the first noisy voice data and the corresponding transcription text to obtain a voice recognition model in the target field.

The respective modules in the above-described voice data processing apparatus may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 11. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing voice data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech data processing method.

Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the above-described method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In an embodiment, a computer program product is provided, comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method of processing speech data, the method comprising:

acquiring pure voice sample data and noisy voice sample data in a target field;

2. The method according to claim 1, wherein the pre-processing the noisy speech sample data and the clean speech sample data respectively to obtain a noisy image corresponding to the noisy speech sample data and a clean image corresponding to the clean speech sample data comprises:

3. The method of claim 1, wherein training a generative countermeasure network from the noisy image and the clean image to obtain a domain transformation model comprises:

4. The method of claim 3, wherein determining feature region loss from the noisy image and the clean image comprises:

5. The method of claim 4, wherein determining the feature region loss from the generated noisy features and the generated clean features comprises:

6. The method according to claim 5, wherein the comparing similarity between each generated clean feature and the reference value to obtain the sub-feature region loss corresponding to the generated noisy feature, with any generated noisy feature as a reference value, comprises:

7. A speech data processing apparatus, characterized in that the apparatus comprises:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 6 when executed by a processor.