CN112420022A

CN112420022A - Noise extraction method, device, equipment and storage medium

Info

Publication number: CN112420022A
Application number: CN202011131906.9A
Authority: CN
Inventors: 叶帅帅; 胡新辉; 徐欣康
Original assignee: Zhejiang Tonghuashun Intelligent Technology Co Ltd
Current assignee: Zhejiang Tonghuashun Intelligent Technology Co Ltd
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2021-02-26
Anticipated expiration: 2040-10-21
Also published as: CN112420022B

Abstract

The invention discloses a noise extraction method, a noise extraction device, noise extraction equipment and a storage medium. The method comprises the following steps: acquiring acoustic characteristics of each voice frame in voice data; inputting the acoustic features into a first voice recognition model to obtain a first class label of each voice frame; inputting the acoustic features into a second speech recognition model to obtain a phoneme label of each speech frame; determining a second class label of each voice frame according to the phoneme label; fusing the first class label and the second class label to obtain a target label of each voice frame; and determining a noise section according to the target label, and extracting the noise section. The method can improve the accuracy of noise extraction by fusing the recognition results of the two neural networks to obtain the noise in the voice data.

Description

Noise extraction method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of voice, in particular to a noise extraction method, a noise extraction device, noise extraction equipment and a storage medium.

Background

Speech technology plays a very important role in human-computer interaction as an important branch of Artificial Intelligence (AI). In order to improve the anti-noise performance and robustness of speech technologies such as speech recognition and voiceprint recognition in an actual speech application system, the enhancement of a training corpus by using a noise corpus is the most important and most common technical means.

In the actual use process of the voice technology, most of the voice technology is used with an open-source noise data set, and the matching degree of the noise of the actual use scene is not high, so that the voice technology effects such as voice recognition and voiceprint recognition are not satisfactory. In order to further improve the performance of the speech technology in an actual scene, the key point is that data enhancement needs to be performed by using environmental noise data in an actual application scene, so as to improve the matching degree of training data and a test environment.

In the prior art, noise extraction is performed based on a result of traditional Voice Activity Detection (VAD), but the method often makes a misjudgment when detecting low-energy Voice and high-energy noise, so that the extracted noise also contains Voice segments.

Therefore, how to effectively extract noise in the actual environmental speech is a technical problem to be solved urgently at present.

Disclosure of Invention

The embodiment of the invention provides a noise extraction method, a noise extraction device, noise extraction equipment and a storage medium.

In a first aspect, an embodiment of the present invention provides a noise extraction method, including:

acquiring acoustic characteristics of each voice frame in voice data;

inputting the acoustic features into a first voice recognition model to obtain a first class label of each voice frame;

inputting the acoustic features into a second speech recognition model to obtain a phoneme label of each speech frame;

determining a second class label of each voice frame according to the phoneme label;

fusing the first class label and the second class label to obtain a target label of each voice frame;

and determining a noise section according to the target label, and extracting the noise section.

In a second aspect, an embodiment of the present invention further provides a noise extraction apparatus, including:

the acoustic characteristic acquisition module is used for acquiring the acoustic characteristics of each voice frame in the voice data;

the first class label acquisition module is used for inputting the acoustic characteristics into a first voice recognition model to acquire a first class label of each voice frame;

a phoneme label obtaining module, configured to input the acoustic features into a second speech recognition model, so as to obtain a phoneme label of each speech frame;

the second class label determining module is used for determining a second class label of each voice frame according to the phoneme label;

the label fusion module is used for fusing the first class label and the second class label to obtain a target label of each voice frame;

and the noise section extraction module is used for determining a noise section according to the target label and extracting the noise section.

In a third aspect, an embodiment of the present invention further provides a computer device, including:

one or more processors;

storage means for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the noise extraction method described in any embodiment of the present invention.

In a fourth aspect, the embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the noise extraction method provided in any embodiment of the present invention.

The embodiment of the invention provides a noise extraction method, a device, equipment and a storage medium, which comprises the steps of firstly obtaining the acoustic characteristics of each voice frame in voice data; secondly, inputting the acoustic features into a first voice recognition model to obtain a first class label of each voice frame; then inputting the acoustic features into a second speech recognition model to obtain a phoneme label of each speech frame; then determining a second class label of each voice frame according to the phoneme label; fusing the first class label and the second class label to obtain a target label of each voice frame; and finally, determining a noise section according to the target label, and extracting the noise section. According to the noise extraction method disclosed by the embodiment, the recognition results of the two neural networks are fused to obtain the noise in the voice data, so that the accuracy of noise extraction can be improved.

Drawings

Fig. 1 is a schematic flow chart of a noise extraction method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a noise extraction method according to a second embodiment of the present invention;

fig. 3 is an overall flowchart of a noise extraction method according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a noise extraction device according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like. In addition, the embodiments and features of the embodiments in the present invention may be combined with each other without conflict.

The term "include" and variations thereof as used herein are intended to be open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment".

Example one

Fig. 1 is a flowchart of a noise extraction method according to an embodiment of the present invention, where the method is applicable to a case where a noise segment is extracted from speech in an actual environment, and the method may be executed by a noise extraction apparatus, where the apparatus may be implemented by software and/or hardware and is generally integrated on a computer device.

As shown in fig. 1, a noise extraction method provided in the first embodiment of the present invention includes the following steps:

s110, obtaining the acoustic characteristics of each voice frame in the voice data.

Among them, the acoustic feature may be a Mel-scale frequency Cepstral Coefficients (MFCC). In the present embodiment, MFCC is the most commonly used speech feature in speech recognition, and MFCC is a speech feature parameter. Alternatively, perceptual line prediction or other features may also be used as speech features.

Illustratively, taking the acoustic feature as an MFCC as an example, the computer device may perform framing processing on the speech data and then extract the MFCC from the speech frame. Among them, the most common and basic MFCC extraction steps in speech recognition can be simply summarized as: firstly, framing a voice signal, then performing Fourier transform on each frame, obtaining the frequency spectrum of each frame after Fourier transform, solving the energy of the frequency spectrum under each triangle, then extracting the logarithm of the result of the previous step, and finally calculating the cepstrum through Fourier transform.

The result of the whole process can be used to express the speech frame by a 12-20-dimensional vector, and a whole piece of speech data can be expressed as a sequence of the vector, and the vector and the sequence of the vector are modeled in speech recognition.

S120, inputting the acoustic features into a first voice recognition model to obtain a first class label of each voice frame.

In this embodiment, the first Speech recognition model may be a Speech Activity Detection (SAD) based network model, each Speech frame may be classified in Speech or non-Speech through an SAD network, and the SAD may identify whether each Speech frame is a noise frame or a normal frame based on a deep neural network. The first speech recognition model is trained based on a large amount of sample speech data.

Wherein, the first category label can be a number of 0 or 1, 0 represents a noise frame, and 1 represents a normal frame. And inputting the obtained acoustic features into a trained SAD network model to obtain first class labels respectively corresponding to the voice frames.

For example, it is assumed that an acoustic feature corresponding to a certain speech frame is input into the SAD model, and if the output first class label is 1, the speech frame corresponding to the acoustic feature is represented as a normal speech frame, and if the output first class label is 0, the speech frame corresponding to the acoustic feature is represented as a noise frame.

S130, inputting the acoustic features into a second speech recognition model to obtain the phoneme label of each speech frame.

In this embodiment, the second Speech Recognition model may be an acoustic model of Speech Recognition (ASR), where the ASR may recognize a phoneme label corresponding to each Speech frame based on a deep neural network, and the second Speech Recognition model is obtained by training based on a large amount of sample Speech data.

The phonetic unit corresponding to the phoneme label (hereinafter, the phonetic unit is simply referred to as a phoneme) may be understood as the smallest unit in the acoustic features, and the phoneme label may be characterized by natural numbers, and each numerical value corresponds to a phoneme.

For example, if the phoneme information includes phonemes "m" and "i", the phoneme label of the phoneme "m" may be 20, and the phoneme label of the phoneme "i" may be 32.

The phoneme information may be understood as information including content of a phoneme, the phoneme is a minimum voice unit divided according to natural attributes of a voice, and may be analyzed according to pronunciation actions in a syllable, and one action constitutes one phoneme. Illustratively, ma contains two pronunciation actions, m and a, which are two phonemes. The sounds uttered by the same pronunciation action are the same phoneme, and the sounds uttered by different pronunciation actions are different phonemes. For example, in ma-mi, two m-pronunciation actions are the same and are the same phoneme, and a and i-pronunciation actions are different and are different phonemes. The analysis of phonemes is generally described in terms of pronunciation actions.

In this embodiment, after the acoustic features are input into the SAD network model for recognition, the phoneme label of each speech frame can be output.

And S140, determining a second class label of each voice frame according to the phoneme label.

The second class label may be a label that represents whether a speech frame in the speech data is a noise frame, for example, if the current speech frame is a normal speech frame, it is marked with bit 1, and if the current speech frame is a noise frame, it is marked with bit 0.

Specifically, the manner of determining the second class label of each speech frame according to the phoneme label may be: constructing a voice recognition decoding graph HCLG according to the text data and a preset pronunciation dictionary; decoding the phoneme label according to the HCLG to obtain a phoneme category; and determining a second class label of each voice frame according to the phoneme class.

The text data is data corresponding to the voice data, for example, for a song, the lyrics therein are the text data thereof. The phone categories may include silence phones, garbage phones, noise phones, and normal phones.

The preset pronunciation dictionary is established based on a large number of voice samples and comprises a corresponding relation between phonemes and phoneme labels. According to the preset pronunciation dictionary, each phoneme can be accurately corresponding to the phoneme label. In this embodiment, the construction of the speech recognition decoding graph HCLG according to the text data and the preset pronunciation dictionary may be implemented based on the prior art, and will not be described herein again.

In this embodiment, the manner of determining the second class label of each speech frame according to the phoneme type may be that, if the phoneme type is a mute phoneme or a garbage phoneme, the second class label of the current speech frame is 0, and if the phoneme type is a normal phoneme, the second class label of the current speech frame is marked as 1.

Optionally, decoding the phoneme label according to the HCLG, and obtaining the phoneme category may be: integrating the phoneme labels according to the HCLG to obtain phoneme information; determining the phoneme type of phoneme information according to the type of each phoneme label in a preset pronunciation dictionary; matching the phoneme information with each voice frame according to the HCLG; and determining the phoneme type of each voice frame according to the matching result.

Wherein the phoneme information includes a plurality of phoneme labels. The category of the phoneme label may be understood as a category labeled for each phoneme label, for example: it is assumed that in the preset pronunciation dictionary, phoneme label 20 is labeled as noise, phoneme label 99 is labeled as silence, phoneme label 301 is labeled as garbage, and the like, and this is only an example and is not a limitation.

Specifically, the integration of the phoneme tags according to the HCLG and the matching of the phoneme information with each speech frame according to the HCLG can be realized by adopting the existing technology, and are not described herein again.

Specifically, the phoneme type of the phoneme information may be determined according to the type of each phoneme label in the preset pronunciation dictionary, and if at least one of the phoneme labels included in the phoneme information is an abnormal phoneme, the phoneme type of the phoneme information may be determined as an abnormal frame. For example: if a certain piece of phoneme information includes a phoneme label 20, the phoneme type of the phoneme information is noise.

Specifically, after matching the phoneme information with the speech frames, the phoneme class of each speech frame can be obtained.

S150, fusing the first class label and the second class label to obtain a target label of each voice frame.

In this embodiment, the first class tag and the second class tag are subjected to bit-by-frame or operation, so as to obtain a target tag of each speech frame. For example, if the first class label of a speech frame is 1 and the second class label of the speech frame is 0, the target label of the speech frame is 1 after the bit or operation.

The bit or operation rule is that if one of the first class label and the second class label has a value of 1, the target label may be determined to be 1, and only if the first class label and the second class label are both 0, the target label may be determined to be 0.

And S160, determining a noise section according to the target label, and extracting the noise section.

In this embodiment, the noise segment may be speech data combined by a plurality of noise frames. Specifically, the method for determining the noise segment according to the target tag may be: and determining a voice section formed by voice frames with target labels continuously being the first set value as a noise section.

The first set value may be any value equal to or greater than 2. For example, if the target labels of a plurality of consecutive speech frames are all 0, the speech segment combined by the plurality of consecutive speech frames is determined to be a noise segment.

The noise extraction method provided by the embodiment of the invention comprises the steps of firstly, obtaining acoustic characteristics of each voice frame in voice data; secondly, inputting the acoustic features into a first voice recognition model to obtain a first class label of each voice frame; then inputting the acoustic features into a second speech recognition model to obtain a phoneme label of each speech frame; then determining a second class label of each voice frame according to the phoneme information; fusing the first class label and the second class label to obtain a target label of each voice frame; and finally, determining a noise section according to the target label, and extracting the noise section. The method can improve the accuracy of noise extraction by fusing the recognition results of the two neural networks to obtain the noise in the voice data.

Example two

Fig. 2 is a schematic flow chart of a noise extraction method according to a second embodiment of the present invention, where the second embodiment is optimized based on the foregoing embodiments, and further, acquiring an acoustic feature of each speech frame in speech data specifically includes: performing framing processing on voice data to obtain a plurality of voice frames; and extracting acoustic features of the plurality of speech frames.

Further, determining a second class label of each speech frame according to the phoneme label specifically includes: constructing a voice recognition decoding graph HCLG according to the text data and a preset pronunciation dictionary; the text data is data corresponding to the voice data; decoding the phoneme label according to the HCLG to obtain a phoneme category; and determining a second class label of each voice frame according to the phoneme class.

Further, fusing the first category label and the second category label to obtain a target label of each voice frame, including: and carrying out bit or operation on the first class label and the second class label to obtain a target label of each voice frame.

Further, determining a noise segment according to the target tag includes: and determining a voice section formed by voice frames with target labels continuously being the first set value as a noise section.

Correspondingly, as shown in fig. 2, a noise extraction method provided by the second embodiment of the present invention includes the following steps:

s210, performing framing processing on the voice data to obtain a plurality of voice frames.

The voice data may be voice data conforming to an actual environment, and the framing processing of the voice data may be to divide the voice signal into segments to analyze characteristic parameters thereof, wherein each segment is called a "frame", the frame length is generally 10-30 Ms, that is, the length of one frame should be less than the length of one phoneme.

In this embodiment, optionally, the frame length may be 25ms, and the time interval is 10ms, that is, the voice data is framed every 10 ms. After the audio is processed, the audio is split according to frames, and the split small sections of waveforms are changed into multi-dimensional vector information according to the characteristics of human ears.

S220, extracting the acoustic features of the voice frames.

The voice frames and the acoustic features are in one-to-one correspondence, namely one voice frame corresponds to one acoustic feature, and the acoustic features of the voice frames in all voice data can be completely extracted through the method.

Taking the acoustic features as an MFCC for example, extracting the MFCC of the speech frame may include: pre-emphasis processing is carried out on voice data, and a voice signal passes through a high-pass filter; secondly, framing and windowing are carried out on the voice data, then fast Fourier transform is carried out, then discrete cosine transform is carried out through a Mel filter bank, and finally extraction of dynamic differential parameters is carried out.

S230, inputting the acoustic features into a first voice recognition model to obtain a first class label of each voice frame.

S240, inputting the acoustic features into a second speech recognition model to obtain the phoneme label of each speech frame.

And S250, constructing a voice recognition decoding graph HCLG according to the text data and a preset pronunciation dictionary.

The text data is data corresponding to the voice data.

In this embodiment, the speech recognition decoding graph HCLG needs to be constructed in the speech recognition process.

The preset pronunciation dictionary is constructed by text data and comprises a plurality of phoneme labels and categories of the phoneme labels.

S260, decoding the phoneme label according to the HCLG to obtain a phoneme category.

Specifically, decoding the phoneme label according to the HCLG, and obtaining the phoneme category may be: integrating the phoneme labels according to the HCLG to obtain phoneme information; the phoneme information comprises a plurality of phoneme labels; determining the phoneme type of phoneme information according to the type of each phoneme label in a preset pronunciation dictionary; matching the phoneme information with each voice frame according to the HCLG; and determining the phoneme type of each voice frame according to the matching result.

One piece of phoneme information may include a plurality of phoneme tags, for example, the phoneme information is "wo", which includes phonemes "w" and "o", the phoneme tag corresponding to "w" is "110", the phoneme tag corresponding to "o" is "099", and the phoneme tag corresponding to "wo" is "110099".

The phoneme tags are integrated according to the HCLG, and the phoneme information is obtained, which can be understood that the HCLG can integrate a plurality of phoneme tags corresponding to the phoneme information, and the integration mode can include various modes. Exemplarily, the method one: inserting new phoneme labels into a plurality of phoneme labels corresponding to the phoneme information; the second method comprises the following steps: and modifying the wrong phoneme label in the plurality of phoneme labels corresponding to the phoneme information to obtain a correct phoneme label. The labels are integrated according to the mode, so that the phoneme information corresponds to a correct phoneme label. The first and second modes may be performed simultaneously.

The category of the phoneme label may be obtained by looking up a preset pronunciation dictionary, and may include, for example, a silent phoneme, a garbage phoneme, a normal phoneme, and the like. The phoneme category may be understood as classifying the phoneme information into different categories, and the phoneme information may be classified into different categories according to the category of each phoneme label. The phoneme category may include a mute phoneme, a garbage phoneme, a normal phoneme, etc., and for example, if the category of the phoneme label of the phoneme information includes a mute phoneme, which is represented by the numeral 1, the phoneme category may be determined as the mute phoneme by looking up the preset pronunciation dictionary.

The matching principle can be that one phoneme information can correspond to a plurality of speech frames, and the number of the speech frames corresponding to one phoneme information can be determined by the HCLG.

In step S260, the phoneme information may be associated with the speech frame, and after the phoneme type of the phoneme information is determined, the phoneme type of each speech frame may be determined according to the matching result.

S270, determining a second class label of each voice frame according to the phoneme class.

The phoneme classes may include silence phonemes, garbage phonemes, noise phonemes, and normal phonemes. The values of the second class labels may include 0 and 1.

For example, if the phoneme type is silence phoneme, garbage phoneme, or noise phoneme, the second type label is 0, and if the phoneme type is normal phoneme, the second type label is 1.

S280, carrying out bit or operation on the first class label and the second class label to obtain a target label of each voice frame.

Specifically, the bit or operation performed on the first category tag and the second category tag may include two cases. The first case may be that if the first class label and the second class label of the current speech frame are both 0, the target label of the speech frame is 0 after bit or operation; if the first class label and the second class label of the current voice frame are both 1, after bit or operation, the target label of the voice frame is 1; and if one of the first class label and the second class label of the current voice frame is 0 and the other one is 1, performing bit OR operation, and then setting the target label of the voice frame to be 1.

S290, determining a voice section formed by the voice frames with the target labels continuously being the first set value as a noise section.

The first setting value may be a specific preset value, and for example, the first setting value may be any value greater than 2, that is, it means that a speech segment formed by two or more continuous speech frames with a target label of 0 is determined as a noise segment.

The second noise extraction method provided by the embodiment of the invention embodies the process of taking the acoustic characteristics of each voice frame in voice data and determining the second class label of each voice frame according to the phoneme information, and further embodies the process of fusing the first class label and the second class label to obtain the target label of each voice frame and determining the noise section according to the target label. In the method, the SAD model can more effectively judge the noise frame and the normal speech frame, the ASR acoustic model can effectively judge the noise frame according to the phoneme type, and the SAD model and the ASR acoustic model are fused, so that the noise section in the actual speech environment can be accurately identified.

Further, the training process of the first speech recognition model is as follows: acquiring acoustic characteristics of each voice frame in sample voice data and a first class label of each voice frame; and training a first speech recognition model based on a first training data pair formed by the acoustic features and the first class labels.

Wherein the sample speech data is acquired in a dataset and used as a training set for the SAD network model. The first class label is used to characterize whether the speech frame is a normal frame or a noise frame.

Specifically, firstly, frame division processing is performed on sample voice data, acoustic features of voice frames are extracted, and the acoustic features are used as input of an SAD network model. And constructing the acoustic features and the first class labels corresponding to the acoustic features into a first training data pair, and training the SAD network model through the constructed first training data pair.

Training the first speech recognition model may be to label the speech frames according to the correspondence in the first training data pair, wherein the SAD network model resembles a classifier. The training process is a process of continuously optimizing the SAD network model. Illustratively, the noise frame in the sample speech data may be labeled as 0 and the speech frame may be labeled as 1.

Further, the training process of the second speech recognition model is as follows: analyzing the sample text data to obtain phoneme information of each voice frame; the sample text data is data corresponding to the sample voice data; determining a phoneme label corresponding to the phoneme information; training the second speech recognition model based on a second training data pair consisting of the acoustic features and the phoneme labels.

The sample text data may be understood as text data corresponding to the sample speech data, for example, the sample text data may be text data corresponding to the sample speech data input to the SAD network model for training.

The second training data pair may be a pair of data in which an acoustic feature of the sample speech data and a phoneme label of the sample speech data are combined.

Specifically, firstly, the obtained sample speech data can be subjected to framing processing to obtain a plurality of speech frames, then the sample text data is analyzed into phoneme information, a phoneme label corresponding to the phoneme information is determined, then the acoustic features of one speech frame and the phoneme label of the speech frame are constructed into a second training data pair, finally, an ASR acoustic model training is carried out, and the ASR acoustic model training is continuously optimized, so that the input acoustic features and the output phoneme labels have accurate corresponding relations.

Wherein, the acoustic features are input of an ASR acoustic model, and the output of the ASR model is a phoneme label. Wherein the sample speech data may be training set data for an ASR acoustic model.

The parsing of the sample text data into the phoneme information may be completed through a phoneme information table, which may include all the phoneme information, and the content in the text data is split to generate one word, and then the word is corresponding to the phoneme information. Illustratively, the term "good" includes "hao" according to the phoneme information corresponding to the phoneme information table, and the phoneme information includes phonemes "h", "a", and "o".

The phoneme label corresponding to the phoneme information can be determined to correspond to the phoneme label according to a preset pronunciation dictionary. The preset pronunciation dictionary may match the corresponding phoneme label according to the pronunciation of the phoneme information, and different phoneme information may correspond to different labels.

Based on the second training data pair formed by the acoustic features and the phoneme labels, the acoustic features of the speech frame and the phoneme labels corresponding to the phoneme information of the speech frame are combined into a data pair, and the data pair can reflect the corresponding relation between the acoustic features and the phoneme labels.

The acoustic features of the speech frame are input in the ASR acoustic model, the output result may be a phoneme label corresponding to the input acoustic features, and then the output phoneme label is verified according to the phoneme label in the second training data pair.

EXAMPLE III

Fig. 3 is an overall flowchart of a noise extraction method according to a third embodiment of the present invention, and as can be seen from fig. 3, the noise extraction method according to the third embodiment of the present invention is a noise extraction method based on SAD and ASR.

The implementation of the technical scheme of the embodiment is mainly divided into two parts, wherein the first part is used for training an SAD network model and an ASR acoustic model, the SAD model is similar to a classifier, and an input voice frame can be directly labeled by using the SAD model, namely a first class label is obtained; the ASR acoustic model may decode the input speech frame into phonemes, i.e., phoneme information, and then tag the speech frame by determining whether the phonemes are valid phonemes, i.e., obtain a second class tag. The second part is to extract noise through the trained SAD network model and SAR acoustic model in the first part, and to fuse the output result of the SAD network model and the output result of the ASR acoustic model by using a multi-model fusion mechanism.

As shown in fig. 3, the entire noise extraction process can be divided into two major stages, including a model training stage and a noise extraction stage, where the model training stage can include SAD model training and ASR acoustic model training. The noise extraction stage may comprise the steps of:

after model training, test voice can be respectively input into an SAD model and an ASR acoustic model for testing, then a test result is obtained, the output result after the SAD model test is a first class label, the output test result after the ASR acoustic model test is a second class label, then the first class label and the second class label are operated or not operated to obtain a target label, then a corresponding noise frame is extracted according to the target label, and a voice segment corresponding to the noise frame is a noise segment.

Illustratively, the SAD model training phase may include: the SAD training set is input into an SAD model for training, wherein the training data of the SAD training set can comprise a voice file, namely sample voice data, and a label of a corresponding frame level, namely a first class label.

Specifically, the SAD model training phase may include the following steps:

1. and performing framing processing on the sample voice data, and extracting the acoustic characteristics of the voice frame as the input of the SAD model.

2. A training data pair, i.e., a first training data pair, is constructed using the MFCCs and their voice frame labels, i.e., first class labels.

3. And carrying out SAD model training to obtain the final network parameters of the SAD model. I.e. the first speech recognition model is trained.

Wherein the final network parameters may be parameters of a SAD-CNN, SAD-DNN, or SAD-PDN network model. It should be noted that after the SAD model is trained, the model is continuously optimized to obtain the final optimal network parameters, and the network parameters can be used to test the input voice data in the SAD model testing stage.

For example, the ASR acoustic model training phase may include: inputting a training set of the ASR acoustic model into the ASR acoustic model for training, where training data of the training set of the ASR acoustic model may include a speech file, i.e., sample speech data, and a text file corresponding to the speech file, i.e., sample text data.

Specifically, the ASR acoustic model training phase may include the steps of:

step 1, framing processing is carried out on sample voice data, and MFCC of a voice frame is extracted to serve as input of an ASR acoustic model.

And 2, analyzing the sample text file into phonemes, and corresponding to phoneme numerical labels, namely phoneme labels, according to the pronunciation dictionary. Namely determining the phoneme label corresponding to the phoneme information.

And 3, constructing a training data pair by using the MFCC acoustic features and the phoneme numerical labels of the voice frames. I.e. a second training data pair based on the acoustic features and the phoneme labels.

And 4, training an ASR acoustic model to obtain the final network parameters of the model. I.e. the second speech recognition model is trained.

Wherein, the final network parameter can be ASR-CNN, ASR-DNN or ASR-PDN network model parameter. It should be noted that after the ASR acoustic model is trained, the model is continuously optimized to obtain the final optimal network parameters, and the network parameters can be used to test the input speech data in the ASR acoustic model testing stage.

Illustratively, the noise extraction stage may specifically include the following steps:

step 1, performing framing processing on voice data, and extracting MFCC acoustic features of voice frames. The voice data can be understood as test voice data.

And 2, inputting the MFCC acoustic features into a trained SAD model, and outputting labels of the voice frames, namely first class labels, wherein 0 can represent noise frames, and 1 can represent voice frames.

And 3, constructing a voice recognition decoding graph HCLG by using the text data corresponding to the voice data.

And 4, inputting the extracted MFCC acoustic features into a trained ASR acoustic model, and decoding the phoneme type of the output speech frame according to the obtained HCLG.

And 5, judging the class of the voice frame according to the class of the phonemes obtained by decoding, if the class of the voice frame is a mute phoneme or a garbage phoneme, marking the voice frame as 0, otherwise, marking the voice frame as 1, wherein 0 represents a noise frame and 1 represents the voice frame. Decoding the phoneme label according to the HCLG to obtain a phoneme category; and determining a second class label of each voice frame according to the phoneme class.

And 6, performing bit or operation on the output results obtained in the step 2 and the step 5 frame by frame to obtain a final voice frame label, and extracting the voice frames which are continuously 0 to obtain a noise section. The first class label and the second class label are subjected to bit or operation to obtain a target label of each voice frame, and a voice section formed by the voice frames of which the target labels are continuously the first set value is determined as a noise section.

In order to further improve the performance of the speech technology in an actual scene, when a model is trained, data enhancement cannot be performed only by using an open-source noise data set, and more importantly, data enhancement is performed by using environmental noise data in an actual application scene, so that the matching degree of training data and a test environment is improved. In order to extract noise segments from Voice data in an actual environment, it has been the past practice to perform noise extraction by a Voice Activity Detection (VAD) based result. However, this method often makes a misjudgment when detecting low-energy speech and high-energy noise, so that the extracted noise may also contain speech segments. To address this issue, the present document improves the accuracy of noise extraction by fusing the SAD model and the ASR acoustic model to extract the noise in the real environment from the test dataset using a multi-model fusion mechanism.

In the noise extraction method provided by the embodiment, on the first hand, the SAD technology is applied to the field of noise extraction, and compared with the prior art in which the Voice Activity Detection (VAD) technology is used to extract noise, the SAD technology based on model training has better effect; in the second aspect, the ASR technology is applied to the field of noise extraction, and whether a speech frame or a noise frame is a speech frame or a noise frame can be judged according to the attributes of phonemes; and in the third aspect, a multi-model fusion mechanism is introduced into the noise extraction field, and the SAD model output result and the ASR acoustic model decoding output result are subjected to bit or operation, so that the noise frame can be accurately identified.

Example four

Fig. 4 is a schematic structural diagram of a noise extraction apparatus according to a fourth embodiment of the present invention, which can be applied to the case of extracting a noise segment from speech in an actual environment, where the apparatus can be implemented by software and/or hardware and is generally integrated on a computer device.

As shown in fig. 4, the apparatus includes:

an acoustic feature obtaining module 410, configured to obtain an acoustic feature of each voice frame in the voice data;

a first class label obtaining module 420, configured to input the acoustic feature into a first speech recognition model, so as to obtain a first class label of each speech frame;

a phoneme label obtaining module 430, configured to input the acoustic features into the second speech recognition model, so as to obtain a phoneme label of each speech frame;

a second class label determining module 440, configured to determine a second class label of each speech frame according to the phoneme label;

a tag fusion module 450, configured to fuse the first class tag and the second class tag to obtain a target tag of each speech frame;

and a noise segment extracting module 460, configured to determine a noise segment according to the target tag, and extract the noise segment.

In this embodiment, the apparatus first uses an acoustic feature obtaining module to obtain acoustic features of each speech frame in speech data; then, a first class label obtaining module is used for inputting the acoustic characteristics into a first voice recognition model to obtain a first class label of each voice frame; secondly, a phoneme label obtaining module is used for inputting the acoustic features into a second speech recognition model to obtain a phoneme label of each speech frame; then a second class label determining module is used for determining a second class label of each voice frame according to the phoneme label; then, a label fusion module is used for fusing the first class label and the second class label to obtain a target label of each voice frame; and finally, a noise section extraction module is used for determining a noise section according to the target label and extracting the noise section.

The embodiment provides a noise extraction device, which can improve the accuracy of noise extraction by fusing the recognition results of two neural networks to obtain noise in voice data. .

Further, the acoustic feature obtaining module 410 is specifically configured to: performing framing processing on voice data to obtain a plurality of voice frames; and extracting acoustic features of the plurality of speech frames.

Further, the second category label determining module 440 is specifically configured to construct a speech recognition decoding graph HCLG according to the text data and a preset pronunciation dictionary; the text data is data corresponding to the voice data; decoding the phoneme label according to the HCLG to obtain a phoneme category; and determining a second class label of each voice frame according to the phoneme class.

Further, a decoding module, configured to integrate the phoneme label according to the HCLG to obtain factor information; the phoneme information comprises a plurality of phoneme labels; determining the phoneme type of the phoneme information according to the type of each phoneme label in the pronunciation dictionary; matching the phoneme information with each voice frame according to the HCLG; and determining the phoneme type of each voice frame according to the matching result. Further, the label fusion module 450 is specifically configured to perform bit or operation on the first class label and the second class label to obtain a target label of each speech frame.

Further, the noise segment extracting module 460 is further configured to determine a voice segment formed by the voice frame with the target tag being continuously the first setting value as the noise segment.

Further, the first training module is used for acquiring acoustic characteristics of each voice frame in the sample voice data and a first class label of each voice frame; and training a first speech recognition model based on a first training data pair formed by the acoustic features and the first class labels.

Further, the second training module is used for analyzing the sample text data to obtain phoneme information of each voice frame; the sample text data is data corresponding to the sample voice data; determining a phoneme numerical value corresponding to the phoneme information; training the second speech recognition model based on a second training data pair consisting of the acoustic features and the phoneme numerical values.

The noise extraction device can execute the noise extraction method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE five

Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention. As shown in fig. 5, a computer device provided in the fourth embodiment of the present invention includes: one or more processors 51 and storage 52; the processor 51 in the computer device may be one or more, and fig. 5 illustrates one processor 51 as an example; storage 52 is used to store one or more programs; the one or more programs are executed by the one or more processors 51, so that the one or more processors 51 implement the noise extraction method according to any one of the embodiments of the present invention.

The computer device may further include: an input device 53 and an output device 54.

The processor 51, the storage means 52, the input means 53 and the output means 54 in the computer apparatus may be connected by a bus or other means, which is exemplified in fig. 5.

The storage device 52 in the computer device is used as a computer-readable storage medium for storing one or more programs, which may be software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the noise extraction method according to one or two embodiments of the present invention (for example, the modules in the noise extraction device shown in fig. 4 include the first class tag obtaining module 420, the second class tag obtaining module 440, the tag fusing module 450, and the noise segment extracting module 460). The processor 51 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the storage device 52, that is, implements the noise extraction method in the above-described method embodiment.

The storage device 52 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the storage 52 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the storage 52 may further include memory located remotely from the processor 51, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 53 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the computer apparatus. The output device 54 may include a display device such as a display screen.

And, when one or more programs included in the above-mentioned computer apparatus are executed by the one or more processors 51, the programs perform the following operations:

acquiring acoustic characteristics of each voice frame in voice data;

EXAMPLE six

An embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, the computer program being executed by a processor to perform a noise extraction method, the method including:

acquiring acoustic characteristics of each voice frame in voice data;

inputting the first voice recognition model to obtain a first class label of each voice frame;

inputting the second speech recognition model to obtain a phoneme label of each speech frame;

Optionally, the program, when executed by the processor, may be further configured to perform a noise extraction method provided in any of the embodiments of the present invention.

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a flash Memory, an optical fiber, a portable CD-ROM, an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. A computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take a variety of forms, including, but not limited to: an electromagnetic signal, an optical signal, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, Radio Frequency (RF), etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method of noise extraction, comprising:

acquiring acoustic characteristics of each voice frame in voice data;

2. The method of claim 1, wherein obtaining the acoustic characteristics of each speech frame in the speech data comprises:

performing framing processing on voice data to obtain a plurality of voice frames;

and extracting acoustic features of the plurality of speech frames.

3. The method of claim 1, wherein determining the second class label for each speech frame based on the phoneme label comprises:

constructing a voice recognition decoding graph HCLG according to the text data and a preset pronunciation dictionary; the text data is data corresponding to the voice data;

decoding the phoneme label according to the HCLG to obtain a phoneme category;

and determining a second class label of each voice frame according to the phoneme class.

4. The method of claim 3, wherein decoding the phoneme label according to the HCLG to obtain a phoneme class comprises:

integrating the phoneme labels according to the HCLG to obtain phoneme information; the phoneme information comprises a plurality of phoneme labels;

determining the phoneme type of the phoneme information according to the type of each phoneme label in the preset pronunciation dictionary;

matching the phoneme information with each voice frame according to the HCLG;

and determining the phoneme type of each voice frame according to the matching result.

5. The method of claim 1, wherein fusing the first class label and the second class label to obtain a target label for each speech frame comprises:

and carrying out bit or operation on the first class label and the second class label to obtain a target label of each voice frame.

6. The method of claim 1, wherein determining a noise segment from the target tag comprises:

and determining a voice section formed by voice frames with target labels continuously being the first set value as a noise section.

7. The method according to any of claims 1-6, wherein the training process of the first speech recognition model is:

acquiring acoustic characteristics of each voice frame in sample voice data and a first class label of each voice frame;

and training a first speech recognition model based on a first training data pair formed by the acoustic features and the first class labels.

8. The method of claim 7, wherein the second speech recognition model is trained by:

analyzing the sample text data to obtain phoneme information of each voice frame; the sample text data is data corresponding to the sample voice data;

determining a phoneme label corresponding to the phoneme information;

training the second speech recognition model based on a second training data pair consisting of the acoustic features and the phoneme labels.

9. A noise extraction device, characterized by comprising:

a phoneme label obtaining module, configured to input the acoustic features into a second speech recognition model, so as to obtain phoneme information of each speech frame;

the second class label determining module is used for determining a second class label of each voice frame according to the phoneme information;

10. A computer device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the noise extraction method of any one of claims 1-8.

11. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the noise extraction method according to any one of claims 1 to 8.