CN112420022B

CN112420022B - Noise extraction method, device, equipment and storage medium

Info

Publication number: CN112420022B
Application number: CN202011131906.9A
Authority: CN
Inventors: 叶帅帅; 胡新辉; 徐欣康
Original assignee: Zhejiang Tonghuashun Intelligent Technology Co Ltd
Current assignee: Zhejiang Tonghuashun Intelligent Technology Co Ltd
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2024-05-10
Anticipated expiration: 2040-10-21
Also published as: CN112420022A

Abstract

The invention discloses a noise extraction method, a device, equipment and a storage medium. The method comprises the following steps: acquiring acoustic characteristics of each voice frame in voice data; inputting the acoustic features into a first voice recognition model to obtain a first class label of each voice frame; inputting the acoustic features into a second voice recognition model to obtain phoneme labels of each voice frame; determining a second class label of each voice frame according to the phoneme label; fusing the first class labels and the second class labels to obtain target labels of all voice frames; and determining a noise section according to the target tag, and extracting the noise section. According to the method, the recognition results of the two neural networks are fused to obtain the noise in the voice data, so that the accuracy of noise extraction can be improved.

Description

Noise extraction method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of voice, in particular to a noise extraction method, device, equipment and storage medium.

Background

Speech technology plays a very important role in human-computer interaction as an important branch of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI). In order to improve the noise immunity and robustness of speech technologies such as speech recognition and voiceprint recognition in an actual speech application system, the enhancement of training corpus by using noise corpus is the most important and common technical means.

In the actual use process of the voice technology, most of the voice technologies use noise data sets with open sources, and the degree of matching between the noise data sets and the noise of the actual use scene is not high, so that the effects of the voice technologies such as voice recognition and voiceprint recognition are not satisfactory. In order to further improve the performance of the voice technology in an actual scene, the key point is to utilize the environmental noise data in the actual application scene to carry out data enhancement, and improve the matching degree of training data and a test environment.

In the prior art, noise extraction is performed based on the result of conventional voice activity detection (Voice Activity Detection, VAD), but this method often misjudges when detecting low-energy voice and high-energy noise, so that the extracted noise also contains voice fragments.

Therefore, how to effectively extract the noise in the actual environmental voice is a technical problem to be solved currently.

Disclosure of Invention

The embodiment of the invention provides a noise extraction method, a device, equipment and a storage medium.

In a first aspect, an embodiment of the present invention provides a noise extraction method, including:

acquiring acoustic characteristics of each voice frame in voice data;

inputting the acoustic features into a first voice recognition model to obtain a first class label of each voice frame;

inputting the acoustic features into a second voice recognition model to obtain phoneme labels of each voice frame;

determining a second class label of each voice frame according to the phoneme label;

fusing the first class labels and the second class labels to obtain target labels of all voice frames;

And determining a noise section according to the target tag, and extracting the noise section.

In a second aspect, an embodiment of the present invention further provides a noise extraction apparatus, including:

the acoustic feature acquisition module is used for acquiring acoustic features of each voice frame in the voice data;

The first class label acquisition module is used for inputting the acoustic characteristics into a first voice recognition model to acquire first class labels of each voice frame;

The phoneme label obtaining module is used for inputting the acoustic characteristics into a second voice recognition model to obtain phoneme labels of the voice frames;

the second class label determining module is used for determining a second class label of each voice frame according to the phoneme label;

the label fusion module is used for fusing the first class labels and the second class labels to obtain target labels of all voice frames;

and the noise segment extraction module is used for determining a noise segment according to the target tag and extracting the noise segment.

In a third aspect, an embodiment of the present invention further provides a computer apparatus, including:

One or more processors;

A storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the noise extraction method described in any of the embodiments of the present invention.

In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a noise extraction method as provided by any of the embodiments of the present invention.

The embodiment of the invention provides a noise extraction method, a device, equipment and a storage medium, wherein the method, the device, the equipment and the storage medium firstly acquire acoustic characteristics of each voice frame in voice data; secondly, inputting the acoustic features into a first voice recognition model to obtain first class labels of each voice frame; inputting the acoustic features into a second voice recognition model to obtain phoneme labels of each voice frame; then determining a second class label of each voice frame according to the phoneme label; fusing the first class labels and the second class labels to obtain target labels of all voice frames; and finally, determining a noise section according to the target tag, and extracting the noise section. According to the noise extraction method disclosed by the embodiment, the recognition results of the two neural networks are fused to obtain the noise in the voice data, so that the accuracy of noise extraction can be improved.

Drawings

Fig. 1 is a flow chart of a noise extraction method according to a first embodiment of the present invention;

fig. 2 is a flow chart of a noise extraction method according to a second embodiment of the present invention;

fig. 3 is an overall flowchart of a noise extraction method according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a noise extraction device according to a fourth embodiment of the present invention;

Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently, or at the same time. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like. Furthermore, embodiments of the invention and features of the embodiments may be combined with each other without conflict.

The term "comprising" and variants thereof as used herein is intended to be open ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment".

Example 1

Fig. 1 is a flow chart of a noise extraction method according to a first embodiment of the present invention, where the method may be applied to a case of extracting a noise segment from a speech of an actual environment, and the method may be performed by a noise extraction device, where the device may be implemented by software and/or hardware and is generally integrated on a computer device.

As shown in fig. 1, a noise extraction method provided in a first embodiment of the present invention includes the following steps:

s110, acquiring acoustic characteristics of each voice frame in voice data.

The acoustic feature may be Mel-frequency cepstrum coefficient (Mel-scaleFrequency Cepstral Coefficients, MFCC), among others. In this embodiment, the MFCC is the most commonly used speech feature in speech recognition, and the MFCC is a speech feature parameter. Alternatively, perceptual line predictions or other features may also be used as speech features.

Illustratively, taking acoustic features as an example of MFCCs, a computer device may frame speech data and then extract the MFCCs in the speech frames. The most common and basic MFCC extraction steps in speech recognition can be summarized simply as follows: firstly, framing a voice signal, then carrying out Fourier transform on each frame, obtaining a frequency spectrum of each frame after the Fourier transform, solving the energy of the frequency spectrum under each triangle, then extracting the logarithm of the result of the last step, and finally calculating a cepstrum through the Fourier transform.

The result of the above process can be a compact representation of speech frames with a 12-20 dimensional vector, and a whole segment of speech data can be represented as a sequence of the vector, modeling the vectors and sequence of vectors in speech recognition.

S120, inputting the acoustic features into a first voice recognition model to obtain first class labels of each voice frame.

In this embodiment, the first speech recognition model may be a network model based on a speech activity Detection (SPEECH ACTIVITY Detection, SAD), and each speech frame may be classified in speech or non-speech by the SAD network, and the SAD may recognize whether each speech frame is a noise frame or a normal frame based on the deep neural network. The first speech recognition model is trained based on a plurality of sample speech data.

Wherein the first class label may be a number of 0 or 1, 0 representing a noise frame, and 1 representing a normal frame. And inputting the acquired acoustic features into a trained SAD network model to obtain first class labels corresponding to each voice frame.

For example, assuming that an acoustic feature corresponding to a certain speech frame is input into the SAD model, the output first class label is 1, the speech frame corresponding to the acoustic feature is indicated as a normal speech frame, and if the output first class label is 0, the speech frame corresponding to the acoustic feature is indicated as a noise frame.

S130, inputting the acoustic features into a second voice recognition model to obtain the phoneme labels of the voice frames.

In this embodiment, the second speech recognition model may be an acoustic model of speech recognition (Auto-Speech Recognition, ASR), and the ASR may recognize a phoneme label corresponding to each speech frame based on the deep neural network, and the second speech recognition model is obtained based on a large number of sample speech data training.

The phonetic unit corresponding to the phoneme label (hereinafter, the phonetic unit is simply referred to as a phoneme) may be understood as a minimum unit in the acoustic feature, and the phoneme label may be characterized by a natural number, and each numerical value corresponds to one phoneme.

For example, if the phoneme information includes phonemes "m" and "i", the phoneme label of the phoneme "m" may be 20 and the phoneme label of the phoneme "i" may be 32.

The phoneme information may be understood as information including the content of phonemes, which are minimum phonetic units divided according to natural attributes of voices, and may be analyzed according to pronunciation actions in syllables, one action constituting one phoneme. Illustratively, ma contains m and a pronunciation actions, which are two phonemes. The sounds made by the same pronunciation action are the same phonemes, and the sounds made by different pronunciation actions are different phonemes. In ma-mi, the two m pronunciation actions are the same and are the same phonemes, and the a and i pronunciation actions are different and are different phonemes. The analysis of phonemes is generally described in terms of pronunciation actions.

In this embodiment, after the acoustic features are input to the SAD network model for recognition, the phoneme label of each speech frame may be output.

S140, determining a second class label of each voice frame according to the phoneme label.

The second class tag may be a tag that characterizes whether a speech frame in the speech data is a noise frame, for example, if the current speech frame is a normal speech frame, it is marked with bit 1, and if the current speech frame is a noise frame, it is marked with 0.

Specifically, the manner of determining the second class label of each speech frame according to the phoneme label may be: constructing a voice recognition decoding diagram HCLG according to the text data and a preset pronunciation dictionary; decoding the phoneme label according to HCLG to obtain a phoneme category; and determining a second class label of each voice frame according to the phoneme class.

The text data is data corresponding to voice data, for example, for a song, the lyrics inside are text data thereof. The phoneme categories may include silence phonemes, garbage phonemes, noise phonemes, normal phonemes, and the like.

The preset pronunciation dictionary is built based on a large number of voice samples and comprises a corresponding relation between phonemes and phoneme labels. According to the preset pronunciation dictionary, each phoneme can be accurately corresponding to the phoneme label. In this embodiment, the construction of the speech recognition decoding diagram HCLG according to the text data and the preset pronunciation dictionary may be implemented based on the existing technology, which is not described herein.

In this embodiment, the manner of determining the second class label of each speech frame according to the phoneme type may be that if the phoneme type is a mute phoneme or a garbage phoneme, the second class label of the current speech frame is set to 0, and if the phoneme type is a normal phoneme, the second class label of the current speech frame is set to 1.

Optionally, the method for decoding the phoneme label according to HCLG to obtain the phoneme category may be: integrating the phoneme labels according to HCLG to obtain phoneme information; determining the phoneme category of the phoneme information according to the category of each phoneme label in a preset pronunciation dictionary; matching the phoneme information with each speech frame according to HCLG; and determining the phoneme category of each voice frame according to the matching result.

Wherein the phoneme information comprises a plurality of phoneme labels. The class of phoneme labels may be understood as the class of labeling each phoneme label, for example: it is assumed that the phoneme label 20 is marked as noise, the phoneme label 99 is marked as mute, the phoneme label 301 is marked as garbage, etc. in the preset pronunciation dictionary, this is merely an example and is not limited thereto.

Specifically, the integration of the phoneme label according to HCLG and the matching of the phoneme information with each speech frame according to HCLG may be implemented by using the existing technology, which is not described herein again.

Specifically, the manner of determining the phoneme type of the phoneme information according to the type of each phoneme label in the preset pronunciation dictionary may be that if at least one of the phoneme labels included in the phoneme information is an abnormal phoneme, the phoneme type of the phoneme information is determined to be an abnormal frame. For example: if a phoneme label 20 is included in a certain phoneme information, the phoneme type of the phoneme information is noise.

Specifically, after the phoneme information is matched with the speech frames, the phoneme category of each speech frame can be obtained.

And S150, fusing the first class labels and the second class labels to obtain target labels of all the voice frames.

In this embodiment, the first class tag and the second class tag are bit or operated frame by frame, so as to obtain the target tag of each voice frame. For example, if the first class label of a voice frame is 1 and the second class label of the voice frame is 0, the target label of the voice frame is 1 after performing bit or operation.

Wherein, the bit or the rule of the operation is that if one of the first class label and the second class label is 1, the target label can be determined as 1, and only if the first class label and the second class label are both 0, the target label can be determined as 0.

S160, determining a noise section according to the target tag, and extracting the noise section.

In this embodiment, the noise segment may be speech data formed by combining a plurality of noise frames. Specifically, the method for determining the noise section according to the target tag may be: and determining a voice segment formed by voice frames with the target tag being continuously set to a first set value as a noise segment.

Wherein the first set value may be an arbitrary value of 2 or more. For example, if the target labels of the plurality of continuous speech frames are all 0, the speech segment formed by combining the plurality of continuous speech frames is determined to be a noise segment.

The first embodiment of the invention provides a noise extraction method, which comprises the steps of firstly obtaining acoustic characteristics of each voice frame in voice data; secondly, inputting the acoustic features into a first voice recognition model to obtain first class labels of each voice frame; inputting the acoustic features into a second voice recognition model to obtain phoneme labels of each voice frame; then determining a second class label of each voice frame according to the phoneme information; fusing the first class labels and the second class labels to obtain target labels of all voice frames; and finally, determining a noise section according to the target tag, and extracting the noise section. According to the method, the recognition results of the two neural networks are fused to obtain the noise in the voice data, so that the accuracy of noise extraction can be improved.

Example two

Fig. 2 is a schematic flow chart of a noise extraction method according to a second embodiment of the present invention, where the second embodiment optimizes the noise extraction method based on the above embodiments, and further obtains acoustic features of each voice frame in voice data, and specifically includes: carrying out framing treatment on the voice data to obtain a plurality of voice frames; and extracting acoustic features of the plurality of speech frames.

Further, determining a second class label of each voice frame according to the phoneme label specifically includes: constructing a voice recognition decoding diagram HCLG according to the text data and a preset pronunciation dictionary; the text data is data corresponding to the voice data; decoding the phoneme label according to HCLG to obtain a phoneme category; and determining a second class label of each voice frame according to the phoneme class.

Further, fusing the first class tag and the second class tag to obtain a target tag of each voice frame, including: and carrying out bit or operation on the first class label and the second class label to obtain target labels of all voice frames.

Further, determining a noise segment according to the target tag includes: and determining a voice segment formed by voice frames with the target tag being continuously set to a first set value as a noise segment.

Accordingly, as shown in fig. 2, a noise extraction method provided in the second embodiment of the present invention includes the following steps:

s210, carrying out framing processing on the voice data to obtain a plurality of voice frames.

The speech data may be speech data conforming to the actual environment, and the framing processing of the speech data may analyze the characteristic parameters of the speech signal by dividing the speech signal into segments, where each segment is called a "frame", and the frame length is generally 10-30 Ms, i.e. the length of a frame should be smaller than the length of a phoneme.

In this embodiment, alternatively, the frame length may be 25ms, and the time interval is 10ms, that is, the voice data is framed every 10 ms. After the audio is subjected to signal processing, the audio is split according to frames, and the split small-segment waveforms are changed into multidimensional vector information according to human ear characteristics.

S220, extracting acoustic features of the plurality of voice frames.

The voice frames and the acoustic features are in one-to-one correspondence, namely one voice frame corresponds to one acoustic feature, and the acoustic features of the voice frames in all voice data can be extracted through the mode.

Taking acoustic features as an example of MFCCs, extracting the MFCCs for a speech frame may include: pre-emphasis processing is carried out on voice data, and voice signals pass through a high-pass filter; secondly, framing and windowing are carried out on voice data, then fast Fourier transformation is carried out, then a Mel filter bank is carried out, then discrete cosine transformation is carried out, and finally dynamic differential parameter extraction is carried out.

S230, inputting the acoustic features into a first voice recognition model to obtain first class labels of each voice frame.

S240, inputting the acoustic features into a second voice recognition model to obtain the phoneme labels of the voice frames.

S250, constructing a voice recognition decoding diagram HCLG according to the text data and a preset pronunciation dictionary.

The text data is data corresponding to voice data.

In this embodiment, a speech recognition decoding diagram HCLG needs to be constructed in the speech recognition process.

The preset pronunciation dictionary is constructed by text data and comprises a plurality of phoneme labels and categories of the phoneme labels.

S260, decoding the phoneme label according to HCLG to obtain a phoneme category.

Specifically, the method for decoding the phoneme label according to HCLG to obtain the phoneme category may be: integrating the phoneme labels according to HCLG to obtain phoneme information; the phoneme information includes a plurality of phoneme labels; determining the phoneme category of the phoneme information according to the category of each phoneme label in a preset pronunciation dictionary; matching the phoneme information with each speech frame according to HCLG; and determining the phoneme category of each voice frame according to the matching result.

One piece of phoneme information may include a plurality of phoneme labels, and the phoneme information is illustrated as "wo", where phonemes included in the phoneme information are "w" and "o", the phoneme label corresponding to "w" is "110", the phoneme label corresponding to "o" is "099", and then the phoneme label corresponding to "wo" may be "110099".

The integration of the phoneme labels according to HCLG to obtain the phoneme information may be understood that HCLG may integrate a plurality of phoneme labels corresponding to the phoneme information, and the integration manner may include various ways. Exemplary, mode one: inserting new phoneme labels into a plurality of phoneme labels corresponding to the phoneme information; mode two: and changing the wrong phoneme label in the plurality of phoneme labels corresponding to the phoneme information to obtain the correct phoneme label. According to the method, the labels are integrated, so that the correct phoneme label corresponding to the phoneme information can be obtained. The first and second modes may be performed simultaneously.

The category of the phoneme label can be obtained by searching a preset pronunciation dictionary, and the category of the phoneme label can comprise mute phonemes, garbage phonemes, normal phonemes and the like. The phoneme category may be understood as classifying the phoneme information into different categories, and the phoneme information may be classified into different categories according to the category of each phoneme label. The phoneme category may include a mute phoneme, a garbage phoneme, a normal phoneme, etc., and, for example, if the category of the phoneme label of the phoneme information includes a mute phoneme, which is represented by a numeral 1, the phoneme category thereof may be determined as a mute phoneme by looking up a preset pronunciation dictionary.

The phoneme information is matched with each voice frame according to HCLG, the matching principle can be that one phoneme information can correspond to a plurality of voice frames, and the number of voice frames corresponding to one phoneme information can be determined by HCLG.

The matching in step S260 may be performed to correspond the phoneme information to the speech frames, and after determining the phoneme type of the phoneme information, the phoneme type of each speech frame may be determined according to the matching result.

S270, determining a second class label of each voice frame according to the phoneme class.

The phoneme category may include silence phonemes, garbage phonemes, noise phonemes, and normal phonemes. The values of the second class labels may include 0 and 1.

For example, if the phoneme class is mute, garbage, or noise, the second class label is 0, and if the phoneme class is normal, the second class label is 1.

S280, carrying out bit or operation on the first class label and the second class label to obtain target labels of all voice frames.

Specifically, performing a bit or operation on the first class tag and the second class tag may include two cases. The first case may be that if the first class tag and the second class tag of the current speech frame are both 0, after performing bit or operation, the target tag of the speech frame is 0; if the first class label and the second class label of the current voice frame are both 1, after bit or operation, the target label of the voice frame is 1; if one of the first class label and the second class label of the current voice frame is 0 and one of the first class label and the second class label is 1, after the bit or operation is performed, the target label of the voice frame is 1.

S290, determining a voice segment formed by voice frames with the target label being continuously set to a first set value as a noise segment.

The first setting value may be a specific value set in advance, and illustratively, the first setting value may be any value above 2, that is, it means that a speech segment formed by two or more consecutive speech frames with a target tag of 0 is determined as a noise segment.

The second embodiment of the present invention provides a noise extraction method, which embodies a process of taking acoustic features of each voice frame in voice data and determining a second class label of each voice frame according to the phoneme information, and further embodies a process of fusing the first class label and the second class label to obtain a target label of each voice frame and determining a noise section according to the target label. In the method, the SAD model can more effectively judge the noise frame and the normal speech frame, the ASR acoustic model can effectively judge the noise frame according to the phoneme category, and the SAD model and the ASR acoustic model are fused, so that the noise section in the actual speech environment can be accurately identified.

Further, the training process of the first speech recognition model is as follows: acquiring acoustic characteristics of each voice frame in sample voice data and a first class label of each voice frame; and training a first voice recognition model based on a first training data pair formed by the acoustic features and the first class labels.

The sample voice data is acquired in the data set and serves as a training set of the SAD network model. The first class label is used to characterize whether a speech frame is a normal frame or a noisy frame.

Specifically, firstly, framing processing is carried out on sample voice data, acoustic features of voice frames are extracted, and the acoustic features are used as input of an SAD network model. And constructing the first class labels corresponding to the acoustic features into a first training data pair, and training the SAD network model through the constructed first training data pair.

Training the first speech recognition model may be to label the speech frames according to the correspondence in the first training data pair, wherein the SAD network model resembles a classifier. The training process is a process of continuously optimizing the SAD network model. Illustratively, noise frames in the sample speech data may be labeled 0 and speech frames labeled 1.

Further, the training process of the second speech recognition model is as follows: analyzing the sample text data to obtain the phoneme information of each voice frame; the sample text data is data corresponding to the sample voice data; determining a phoneme label corresponding to the phoneme information; and training the second voice recognition model based on a second training data pair formed by the acoustic features and the phoneme labels.

The sample text data may be understood as text data corresponding to sample speech data, for example, the sample text data may be text data corresponding to sample speech data that is input into the SAD network model for training.

The second training data pair may be understood as a pair of data formed by combining the acoustic features of the sample voice data and the phoneme labels of the sample voice data.

Specifically, firstly, the acquired sample voice data can be subjected to framing processing to obtain a plurality of voice frames, then the sample text data is analyzed into phoneme information, the corresponding phoneme label is determined, then the acoustic characteristics of one voice frame and the phoneme label of the voice frame are constructed into a second training data pair, finally, ASR acoustic model training is performed, and the ASR acoustic model training is continuously optimized, so that the input acoustic characteristics and the output phoneme label have an accurate corresponding relation.

The acoustic features are inputs of an ASR acoustic model, and outputs of the ASR model are phoneme labels. Wherein the sample speech data may be training set data of an ASR acoustic model.

The parsing of the sample text data into phoneme information may be completed through a phoneme information table, the phoneme information table may include all phoneme information, the content in the text data is split to generate a word, and then the word is corresponding to the phoneme information. Illustratively, the term "good" includes "hao" according to the phoneme information corresponding to the phoneme information table, and the phonemes included in the phoneme information are "h", "a" and "o".

The determining the phoneme label corresponding to the phoneme information may correspond the phoneme information to the phoneme label according to a preset pronunciation dictionary. The preset pronunciation dictionary may match corresponding phoneme labels for the pronunciation of the phoneme information, and different phoneme information may correspond to different labels.

The second training data pair formed by the acoustic features and the phoneme labels can be understood as combining the acoustic features of the voice frame and the phoneme labels corresponding to the phoneme information into one data pair, and the data pair can reflect the corresponding relation between the acoustic features and the phoneme labels.

The acoustic features of the speech frame are input in the ASR acoustic model, the output result may be a phoneme label corresponding to the input acoustic features, and the output phoneme label is then verified from the phoneme label in the second training data pair.

Example III

Fig. 3 is an overall flowchart of a noise extraction method according to a third embodiment of the present invention, and as can be seen from fig. 3, the noise extraction method according to the third embodiment of the present invention is a noise extraction method based on SAD and ASR.

The implementation of the technical scheme of the embodiment is mainly divided into two parts, wherein the first part is to train an SAD network model and an ASR acoustic model, the SAD model is similar to a classifier, and the input voice frame can be directly labeled by using the SAD model, namely a first class label is obtained; the ASR acoustic model may decode an input speech frame into phonemes, i.e., phoneme information, and then tag the speech frame by determining whether the phonemes are valid phonemes, i.e., obtain a second class tag. The second part is to extract noise through the trained SAD network model and SAR acoustic model in the first part, and to fuse the output result of the SAD network model and the output result of the ASR acoustic model by using a multi-model fusion mechanism.

As shown in fig. 3, the overall noise extraction process may be divided into two major phases, including a model training phase and a noise extraction phase, where the model training phase may include SAD model training and ASR acoustic model training. The noise extraction stage may comprise the steps of:

After model training, test voice can be respectively input into an SAD model and an ASR acoustic model for testing, then a test result is obtained, the result output after the SAD model tests is a first class label, the test result output after the ASR acoustic model tests is a second class label, then the first class label and the second class label are used as or operated to obtain a target label, then a corresponding noise frame is extracted according to the target label, and a voice segment corresponding to the noise frame is a noise segment.

By way of example, the SAD model training phase may include: the SAD training set is input into the SAD model for training, wherein the training data of the SAD training set can comprise a voice file, namely sample voice data, and a label of a corresponding frame level, namely a first class label.

Specifically, the SAD model training phase may include the steps of:

1. and carrying out framing processing on the sample voice data, and extracting acoustic characteristics of the voice frame as input of the SAD model.

2. The training data pair, i.e., the first training data pair, is constructed using the MFCC and the voice frame tags, i.e., the first class tags.

3. And carrying out SAD model training to obtain final network parameters of the SAD model. I.e. training the first speech recognition model.

The final network parameter may be a parameter of a SAD-CNN, SAD-DNN or SAD-PDN network model. It should be noted that, after the SAD model is trained, model optimization is continuously performed to obtain the final optimal network parameters, and the network parameters can be used to test the input voice data in the SAD model test stage.

Illustratively, the ASR acoustic model training phase may include: and inputting the training set of the ASR acoustic model into the ASR acoustic model for training, wherein the training data of the ASR acoustic model training set can comprise a voice file, namely sample voice data, and a text file corresponding to the voice file, namely sample text data.

Specifically, the ASR acoustic model training phase may include the steps of:

Step 1, carrying out framing processing on sample voice data, and extracting the MFCC of a voice frame as input of an ASR acoustic model.

And 2, analyzing the sample text file into phonemes, and corresponding to the phoneme numerical labels, namely the phoneme labels, according to the pronunciation dictionary. I.e. determining the phoneme label corresponding to the phoneme information.

And 3, constructing training data pairs by using the MFCC acoustic characteristics and the phoneme numerical labels of the voice frames. I.e. a second training data pair based on the acoustic features and the phoneme labels.

And 4, training an ASR acoustic model to obtain final network parameters of the model. I.e. training the second speech recognition model.

The final network parameters may be parameters of an ASR-CNN, ASR-DNN or ASR-PDN network model. After the ASR acoustic model is trained, model optimization is continuously carried out to obtain final optimal network parameters, and the network parameters can be used for testing input voice data in the ASR acoustic model testing stage.

The noise extraction phase may comprise, for example, the following steps:

And step 1, carrying out framing treatment on voice data, and extracting the MFCC acoustic characteristics of the voice frame. Wherein the speech data may be understood as test speech data.

And 2, inputting the acoustic characteristics of the MFCC into the trained SAD model, and outputting a label of the voice frame, namely a first class label, wherein 0 can represent a noise frame and 1 can represent the voice frame.

And 3, constructing a voice recognition decoding diagram HCLG by using text data corresponding to the voice data.

And 4, inputting the extracted MFCC acoustic characteristics into a trained ASR acoustic model, and decoding and outputting the phoneme category of the voice frame according to the obtained HCLG.

And 5, judging the class of the voice frame according to the class of the phonemes obtained by decoding, if the voice frame is a mute phoneme or a garbage phoneme, marking the voice frame as 0, otherwise marking the voice frame as 1, wherein the same 0 represents a noise frame, and the 1 represents a voice frame. Decoding the phoneme label according to HCLG to obtain a phoneme category; and determining a second class label of each voice frame according to the phoneme class.

And 6, performing frame-by-frame bit making or operation on the output results obtained in the step 2 and the step 5 to obtain a final voice frame label, and extracting the voice frames with continuous 0 to obtain a noise section. And performing bit or operation on the first class tag and the second class tag to obtain target tags of all the voice frames, and determining a voice segment formed by voice frames with the target tags being continuously the first set value as a noise segment.

In order to further improve the performance of the voice technology in an actual scene, when the model is trained, the data cannot be enhanced by using the open source noise data set, more importantly, the data enhancement is performed by using the environmental noise data in the actual application scene, and the matching degree of the training data and the test environment is improved. In order to extract noise segments from speech data in real environments, it has been done in the past by performing noise extraction based on the results of conventional speech activity detection (Voice Activity Detection, VAD). However, this method often misjudges when detecting low-energy speech and high-energy noise, so that the extracted noise also contains speech fragments. To solve this problem, the accuracy of noise extraction is improved by extracting noise in the actual environment from the test dataset by fusing the SAD model and the ASR acoustic model using a multi-model fusion mechanism.

In the noise extraction method provided by the embodiment, in the first aspect, the SAD technology is applied to the noise extraction field, and compared with the prior art that the noise is extracted by using the voice activity detection (Voice Activity Detection, VAD) technology, the SAD technology effect based on model training is better; in a second aspect, applying the ASR technique to the noise extraction field, it is possible to distinguish whether it is a speech frame or a noise frame according to the attribute of the phonemes; in the third aspect, a multi-model fusion mechanism is introduced into the noise extraction field, and bit or operation is carried out on the SAD model output result and the ASR acoustic model decoding output result, so that a noise frame can be accurately identified.

Example IV

Fig. 4 is a schematic structural diagram of a noise extraction device according to a fourth embodiment of the present invention, where the device may be adapted to extract noise segments from speech in an actual environment, and the device may be implemented by software and/or hardware and is generally integrated on a computer device.

As shown in fig. 4, the apparatus includes:

An acoustic feature acquisition module 410, configured to acquire acoustic features of each speech frame in the speech data;

a first class label obtaining module 420, configured to input the acoustic feature into a first speech recognition model, and obtain a first class label of each speech frame;

a phoneme label obtaining module 430, configured to input the acoustic feature into a second speech recognition model, and obtain a phoneme label of each speech frame;

a second class label determining module 440, configured to determine a second class label of each speech frame according to the phoneme label;

The tag fusion module 450 is configured to fuse the first class tag and the second class tag to obtain a target tag of each voice frame;

The noise segment extracting module 460 is configured to determine a noise segment according to the target tag, and extract the noise segment.

In this embodiment, the device is configured to obtain, through an acoustic feature obtaining module, acoustic features of each speech frame in speech data; the first class label acquisition module is used for inputting the acoustic characteristics into a first voice recognition model to acquire first class labels of each voice frame; secondly, inputting the acoustic features into a second voice recognition model through a phoneme label acquisition module to acquire phoneme labels of all voice frames; then, a second class label determining module is used for determining a second class label of each voice frame according to the phoneme label; then, a label fusion module is used for fusing the first class label and the second class label to obtain a target label of each voice frame; and finally, determining a noise section according to the target tag by a noise section extraction module, and extracting the noise section.

The embodiment provides a noise extraction device, which can improve the accuracy of noise extraction by fusing the recognition results of two neural networks to obtain noise in voice data. .

Further, the acoustic feature acquisition module 410 is specifically configured to: carrying out framing treatment on the voice data to obtain a plurality of voice frames; and extracting acoustic features of the plurality of speech frames.

Further, the second class label determining module 440 is specifically configured to construct a speech recognition decoding diagram HCLG according to the text data and the preset pronunciation dictionary; the text data is data corresponding to the voice data; decoding the phoneme label according to HCLG to obtain a phoneme category; and determining a second class label of each voice frame according to the phoneme class.

Further, a decoding module, configured to integrate the phoneme label according to the HCLG to obtain factor information; the phoneme information includes a plurality of phoneme labels; determining the phoneme category of the phoneme information according to the category of each phoneme label in the pronunciation dictionary; matching the phoneme information with each voice frame according to the HCLG; and determining the phoneme category of each voice frame according to the matching result. Further, the tag fusion module 450 is specifically configured to perform a bit or operation on the first class tag and the second class tag to obtain a target tag of each voice frame.

Further, the noise segment extraction module 460 is further configured to determine a speech segment formed by speech frames with the target tag being the first set value continuously as the noise segment.

Further, the first training module is used for acquiring acoustic characteristics of each voice frame and a first class label of each voice frame in the sample voice data; and training a first voice recognition model based on a first training data pair formed by the acoustic features and the first class labels.

Further, the second training module is used for analyzing the sample text data to obtain the phoneme information of each voice frame; the sample text data is data corresponding to the sample voice data; determining a phoneme value corresponding to the phoneme information; and training the second voice recognition model based on a second training data pair formed by the acoustic characteristics and the phoneme values.

The noise extraction device can execute the noise extraction method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example five

Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention. As shown in fig. 5, a computer device provided in a fourth embodiment of the present invention includes: one or more processors 51 and storage 52; the number of processors 51 in the computer device may be one or more, one processor 51 being taken as an example in fig. 5; the storage device 52 is used for storing one or more programs; the one or more programs are executed by the one or more processors 51 such that the one or more processors 51 implement a noise extraction method as in any of the embodiments of the present invention.

The computer device may further include: an input device 53 and an output device 54.

The processor 51, the storage means 52, the input means 53 and the output means 54 in the computer device may be connected by a bus or by other means, in fig. 5 by way of example.

The storage device 52 in the computer apparatus is used as a computer readable storage medium, and may be used to store one or more programs, such as a software program, a computer executable program, and a module, such as program instructions/modules corresponding to the noise extraction method provided in the first or second embodiment of the present invention (for example, the modules in the noise extraction device shown in fig. 4 include a first class tag acquisition module 420, a second class tag acquisition module 440, a tag fusion module 450, and a noise segment extraction module 460). The processor 51 executes various functional applications of the computer device and data processing by running software programs, instructions and modules stored in the storage 52, i.e. implements the noise extraction method in the above-described method embodiments.

Storage device 52 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functionality; the storage data area may store data created according to the use of the computer device, etc. In addition, the storage 52 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, storage 52 may further include memory located remotely from processor 51, which may be connected to the device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means 53 may be used to receive entered numeric or character information and to generate key signal inputs related to user settings and function control of the computer device. The output device 54 may include a display device such as a display screen.

And, when one or more programs included in the above-mentioned computer device are executed by the one or more processors 51, the programs perform the following operations:

acquiring acoustic characteristics of each voice frame in voice data;

Example six

A sixth embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program for executing a noise extraction method when executed by a processor, the method comprising:

acquiring acoustic characteristics of each voice frame in voice data;

Inputting the first voice recognition model to obtain a first class label of each voice frame;

inputting the second voice recognition model to obtain a phoneme label of each voice frame;

Optionally, the program may be further configured to perform the noise extraction method provided by any embodiment of the present invention when executed by a processor.

The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access Memory (Random Access Memory, RAM), a Read-Only Memory (ROM), an erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), a flash Memory, an optical fiber, a portable CD-ROM, an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. A computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to: electromagnetic signals, optical signals, or any suitable combination of the preceding. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, radio Frequency (RF), and the like, or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present invention may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A noise extraction method, comprising:

acquiring acoustic characteristics of each voice frame in voice data;

determining a noise section according to the target tag, and extracting the noise section;

the determining a second class label of each voice frame according to the phoneme label comprises the following steps:

Constructing a voice recognition decoding diagram HCLG according to the text data and a preset pronunciation dictionary; the text data is data corresponding to the voice data;

decoding the phoneme label according to HCLG to obtain a phoneme category;

determining a second class label of each voice frame according to the phoneme class;

The decoding the phoneme label according to HCLG to obtain a phoneme category includes:

Integrating the phoneme labels according to HCLG to obtain phoneme information; the phoneme information includes a plurality of phoneme labels;

Determining the phoneme category of the phoneme information according to the category of each phoneme label in the preset pronunciation dictionary;

matching the phoneme information with each voice frame according to the HCLG;

and determining the phoneme category of each voice frame according to the matching result.

2. The method of claim 1, wherein obtaining acoustic features for each speech frame in the speech data comprises:

carrying out framing treatment on the voice data to obtain a plurality of voice frames;

and extracting acoustic features of the plurality of speech frames.

3. The method of claim 1, wherein fusing the first class label and the second class label to obtain a target label for each speech frame comprises:

And carrying out bit or operation on the first class label and the second class label to obtain target labels of all voice frames.

4. The method of claim 1, wherein determining a noise segment from the target tag comprises:

and determining a voice segment formed by voice frames with the target tag being continuously set to a first set value as a noise segment.

5. The method of any one of claims 1-4, wherein the training process of the first speech recognition model is:

Acquiring acoustic characteristics of each voice frame in sample voice data and a first class label of each voice frame;

And training a first voice recognition model based on a first training data pair formed by the acoustic features and the first class labels.

6. The method of claim 5, wherein the training process of the second speech recognition model is:

Analyzing the sample text data to obtain the phoneme information of each voice frame; the sample text data is data corresponding to the sample voice data;

Determining a phoneme label corresponding to the phoneme information;

and training the second voice recognition model based on a second training data pair formed by the acoustic features and the phoneme labels.

7. A noise extraction device, characterized by comprising:

the phoneme label acquisition module is used for inputting the acoustic characteristics into a second voice recognition model to acquire phoneme information of each voice frame;

The second class label determining module is used for determining a second class label of each voice frame according to the phoneme information;

the noise segment extraction module is used for determining a noise segment according to the target tag and extracting the noise segment;

The second class label determining module is specifically configured to construct a speech recognition decoding diagram HCLG according to the text data and a preset pronunciation dictionary; the text data is data corresponding to the voice data; decoding the phoneme label according to HCLG to obtain a phoneme category; determining a second class label of each voice frame according to the phoneme class;

The decoding module is used for integrating the phoneme labels according to the HCLG to obtain phoneme information; the phoneme information includes a plurality of phoneme labels; determining the phoneme category of the phoneme information according to the category of each phoneme label in the pronunciation dictionary; matching the phoneme information with each voice frame according to the HCLG; and determining the phoneme category of each voice frame according to the matching result.

8. A computer device, comprising:

One or more processors;

A storage means for storing one or more programs;

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the noise extraction method of any of claims 1-6.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a noise extraction method as claimed in any one of claims 1-6.