CN114937459A

CN114937459A - Hierarchical fusion audio data enhancement method and system

Info

Publication number: CN114937459A
Application number: CN202210458199.7A
Authority: CN
Inventors: 武星
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-08-23

Abstract

The invention provides a hierarchical fusion audio data enhancement method and a hierarchical fusion audio data enhancement system, wherein the method comprises the following steps: collecting an original audio signal X; performing time domain signal companding on the audio signal X by using a WSOLA algorithm to obtain the companded audio signal X _o (ii) a The processed audio X _o Mixed with the original audio X to form a new training set S _x (ii) a Extracting the frequency of each audio in the training set; fundamental frequency estimation and fundamental frequency set S construction by using SWIPE algorithm _f (ii) a Normalizing the frequency by using the base frequency added with the disturbance to construct a frequency set S _F (ii) a Acoustic features are extracted using a fast fourier transform. The audio data enhancement method provided by the invention improves the noise frequency of the modelCan be adapted to a variety of speech tasks, including but not limited to: audio classification, voiceprint recognition, speech recognition, and the like.

Description

Hierarchical fusion audio data enhancement method and system

Technical Field

The invention relates to an audio data enhancement method applicable to various audio tasks, and belongs to the field of audio data processing.

Background

Currently, most audio tasks rely on the amount of marked data, and the larger the amount of data is, the better the model is. For audio tasks under low resource conditions, data enhancement is a simple and effective method for constructing new samples. By utilizing a data enhancement technology, the model can extract stable voice representation under the condition of a small sample, and compared with an original training method, the recognition effect is greatly improved.

Existing research focuses on data enhancement in both front-end and feature aspects to improve recognition performance in unknown environments. The addition of parameterized reverberation, offset and speed disturbance to the original audio can simulate the noise in the real environment, and the data enhancement by the method can greatly improve the robustness of the model. In addition, augmenting the data using vocal tract length perturbation techniques has also proven effective. In addition, a data enhancement method based on signal compression, which achieves data enhancement by compression and expansion of an original signal based on a-law and μ -law signal compression methods, has been successfully used in the field of sound attack detection.

In addition to data enhancement at the front end, data enhancement can also be performed on audio features, the existing research realizes the data enhancement of the audio features based on fundamental frequency normalization, and a plurality of similar frequencies are constructed by adding different degrees of disturbance to the fundamental frequency. The method can achieve good effect on the voice recognition task.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: for the audio classification task: on one hand, the cost for acquiring a large amount of external noise data to realize data enhancement is high; on the other hand, the samples constructed by the existing method are relatively limited.

In order to solve the above technical problem, an aspect of the present invention provides a hierarchical fusion audio data enhancement method, including the following steps:

a) collecting an original signal, and storing the original signal as audio X in a digital signal form;

b) performing time domain signal companding on the audio X to obtain the companded audio X _o ：

When the time domain signal companding is carried out on the audio X, a waveform similarity overlapping and overlapping algorithm is adopted, a series of distortions are introduced while harmonic signals are kept, and the companding is obtainedRear audio X _o ；

c) The compressed audio frequency X _o Mixed with the original audio X to form a new training set S _x ；

d) For training set S _x Extracting the frequency of each audio frequency to obtain the frequency f;

e) extracting fundamental frequency of each audio to obtain fundamental frequency f _o，def ；

d) For fundamental frequency f _o，def Disturbance addition to form a fundamental frequency set S _f ：

For fundamental frequency f _o，def Adding disturbance, respectively adding frequency offsets of +/-20, +/-40 and +/-60 to obtain disturbed fundamental frequency f _o With the original fundamental frequency f _o，def Jointly form a fundamental frequency set S _f ；

e) Using the set of fundamental frequencies S _f Normalizing the frequency to construct a frequency set S _F ：

Using the set of fundamental frequencies S _f Normalizing the frequency characteristics, extracting the fundamental frequency of each frame by using the step d) for the spectrogram corresponding to the current audio frequency, counting the median of the fundamental frequency, and recording the median as f _o，audio Respectively using the base frequency set S _f Normalized to frequency values at the mel scale:

f _norm ＝f _orig -(f _o，audio -f _o，def )

in the formula (f) _orig Representing frequency values in the Mel scale, f _o，audio Representing the median of the fundamental frequencies of all frames in the current audio, f _o，def Representing a default fundamental frequency;

f obtained by normalization _norm Form a frequency set S _F ；

f) Using the frequency set S _F Performing acoustic feature extraction

Using the frequency set S _F The elements in (1) perform a fast fourier transform of the audio signal into an energy distribution in the frequency domain, different energy distributions being representative of different speech characteristics.

Preferably, any one audio frame in the original audio X is defined as a first audio frame, and then the step b) specifically includes the following steps:

selecting a second audio frame in the left and right range of the first audio frame, wherein the phase parameter of the second audio frame is aligned with the phase of the first audio frame;

searching a third audio frame in a range [ - Δ max, Δ max ], setting the size of Δ max to be half an audio cycle, then calculating a cross-correlation coefficient between frames in the range, and selecting the third audio frame with the highest similarity to the second audio frame;

and splicing the first audio frame, the second audio frame and the third audio frame by the same step length, and adding the overlapped parts.

Preferably, in step c), the compressed audio X is _o Adding the same label as the original audio X and adding the same label to the original data set to jointly form a new training set S _x 。

Preferably, in step d), training set S is applied _x The audio in the system is subjected to framing, windowing and Mel scale transformation, and audio frequency characteristics are extracted, so that the frequency f is obtained.

Preferably, in step e), the fundamental frequency estimation is performed using the SWIPE algorithm, and the significance of the peak of the amplitude spectrum at an integer multiple of each frequency f relative to the two valleys immediately adjacent to it is measured by the peak-to-valley distance, which is defined as:

in the formula: d _k (f) Represents the peak-to-valley distance; k 1, 2., n, representing the multiple, | x (kf) | representing the peak of the amplitude spectrum at k times the frequency f;

the significance is represented by the average peak-to-valley distance of each harmonic as shown in the following formula:

the final estimated fundamental frequency based on saliency is denoted as f _o，def The fundamental frequency f _o，def For subsequent feature normalization.

Another technical solution of the present invention is to provide a hierarchical fusion audio data enhancement system, which is characterized by comprising:

a signal companding unit: is used for carrying out time domain signal companding on the audio X to obtain the companded audio X _o When the audio X is subjected to time domain signal companding, a waveform similarity overlapping and overlapping algorithm is adopted, a series of distortions are introduced while harmonic signals are kept, and the companded audio X is obtained _o ；

A training set construction unit: for compressing the expanded audio X _o Mixed with the original audio X to form a new training set S _x ；

A frequency extraction unit: for extracting training set S _x The frequency of each audio in the audio signal is obtained to obtain a frequency f;

a fundamental frequency extraction unit: for extracting fundamental frequency of each audio to obtain fundamental frequency f _o，def ；

A fundamental frequency disturbance adding unit: for the fundamental frequency f _o，def Adding disturbance to form a fundamental frequency set S _f Wherein for the fundamental frequency f _o，def When disturbance is added, frequency offsets of +/-20 +/-40 +/-60 are respectively added to obtain disturbed fundamental frequency f _o With the original fundamental frequency f _o，def Jointly form a fundamental frequency set S _f ；

A frequency normalization unit: for using the set of fundamental frequencies S _f Normalizing the frequency to construct a frequency set S _F Wherein a set of fundamental frequencies S is used _f When the frequency characteristics are normalized, for the spectrogram corresponding to the current audio frequency, the step d) is used for extracting the fundamental frequency of each frame, and the median of the fundamental frequency is counted and recorded as f _o，audio Respectively using the base frequency set S _f Normalized to frequency values at the mel scale:

f _norm ＝f _orig -(f _o，audio -f _o，def )

in the formula (f) _orig Representing frequency values in the Mel scale, f _o，audio Representing the median of the fundamental frequencies of all frames in the current audio, f _o，def Representing a default fundamental frequency; f obtained by normalization _norm Form a frequency set S _F

An acoustic feature extraction unit: for using the frequency set S _F Extracting acoustic features, wherein the frequency set S is used for extraction _F The elements in (1) are used as reference to perform fast Fourier transform on the signal to convert the signal into energy distribution on a frequency domain, and different energy distributions can represent the characteristics of different voices.

The invention provides a method for enhancing data by hierarchical fusion. Time-scale modification (TSM) is a technique that can change the speed of audio without changing its pitch, and is very important in current audio signal processing. The TSM algorithm equally divides a segment of an audio signal into different frames, then performs a series of processing, such as stretching, compressing, etc., on each frame, and finally re-superimposes the frames into a composite signal. The method realizes companding of the audio signal based on the waveform similarity overlap-add algorithm, applies the method to data enhancement, applies the processed signal to construction of new audio, and realizes data enhancement under the condition of not using additional audio. The audio signals are used for subsequent feature extraction, in the extraction process, data enhancement at a feature level is carried out, the robustness of the model against noise is improved by adding tiny disturbance on sound frequency, and the fundamental frequency normalization-based method can be used for constructing a new sample similar to the original audio and reducing the influence of audio difference caused by environmental noise on the recognition result.

The invention provides a hierarchical fusion audio data enhancement method, which fuses the existing audio data enhancement method, divides the enhancement into different stages, can be used for various audio tasks and has certain universality. The invention can realize audio data enhancement without using extra noise data, and reduce the dependence degree of the existing data enhancement method on external noise data.

Drawings

FIG. 1 is a flow chart of a method of audio data enhancement of the present invention;

fig. 2 is a flow chart of the computation of audio feature data enhancement of the present invention.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It is to be understood that these examples are included merely for the purpose of promoting an understanding of the principles and knowledge of the invention, and are not intended to limit the scope of the invention or to limit the application of the invention. Further, it will be appreciated that various alterations and modifications of the invention will occur to those skilled in the art upon reading the teachings of the invention, but it is intended that variations, changes and modifications of the embodiments based on the principles and spirit of the invention will fall within the scope of the appended claims. And it is to be understood that this description is by way of example only of preferred embodiments and that not all embodiments need be exhaustive.

As shown in fig. 1, the present invention provides a process of a method for enhancing audio data, which includes: the audio signal data enhancement and the audio characteristic data enhancement are two stages. Wherein the audio signal data enhancement comprises signal companding and training set construction; the audio characteristic data enhancement comprises frequency extraction, fundamental frequency disturbance addition, frequency normalization and acoustic characteristic extraction. The specific implementation process is as follows:

the first step is as follows: the raw signal is collected by a sensor and stored as audio X in the form of a digital signal.

The second step is that: performing time domain signal companding on the audio X to obtain the companded audio X _o

Adopting waveform similar overlapping algorithm, while retaining harmonic signal, introducing a series of distortions so as to obtain companded audio frequency X _o . Defining any one audio frame in the original audio as a first audio frame, and then the specific operation of the second step is as follows:

(1) selecting a second audio frame in the left and right range of the first audio frame, wherein the phase parameter of the second audio frame is aligned with the phase of the first audio frame;

(2) searching a third audio frame in a range [ - Δ max, Δ max ], setting the size of Δ max to be half an audio cycle, then calculating a cross-correlation coefficient between frames in the range, and selecting the third audio frame with the highest similarity to the second audio frame;

(3) and splicing the first audio frame, the second audio frame and the third audio frame by the same step length, and adding the overlapped parts.

The third step: the compressed audio frequency X _o Mixed with the original audio X to form a new training set S _x

For the compressed audio X _o Adding the same label as the original audio X and adding the same label to the original data set to jointly form a new training set S _x 。

The fourth step: for training set S _x Extracting the frequency of each audio frequency to obtain the frequency f;

for training set S _x The audio frequency is subjected to operations such as framing, windowing, mel scale transformation and the like, and audio frequency features are extracted, wherein the basic flow is shown in fig. 2, and the specific operations are as follows:

(1) audio framing

Because the audio signal is unstable on the whole, the frequency domain conversion cannot be directly carried out on the whole voice, a section of audio is divided into a plurality of short segments by frames, and each section of short audio is regarded as stable. After framing is completed, the speech can be treated as a stable signal. In the present embodiment, 25ms is used as the frame length, but in order to avoid abrupt signal changes from frame to frame, one frame is taken every 10ms, that is, the frame is shifted to 10 ms.

(2) Window with window

The window function is used to slide over the sliced frames so that the amplitude of a frame signal is gradually changed to zero at both ends, thereby highlighting the middle portion of a frame, which can improve the resolution of the subsequent fourier transform results. Due to the effect of the window function, the amplitude at two ends of each frame can be attenuated, and the frame division can enable two adjacent frames to have an overlapping part, so that each sampling point can obtain the prominence of the window function.

In this embodiment, a hamming window is used in the window function w (n), as shown in the following formula:

in the formula: n represents the position of the (n + 1) th sampling point on one frame signal; n represents the hamming window length, i.e., the total number of sample points, which is 256 in this embodiment.

(3) Mel scale conversion

For the frequencies contained in the current frame, the transform is performed using the mel scale, as shown in the following equation:

in the formula: f is the transformed frequency; f' is the frequency before transformation.

The fifth step: fundamental frequency extraction is performed on each audio frequency

This example uses the SWIPE algorithm for fundamental frequency estimation, and measures the significance of the peak value of the amplitude spectrum at an integer multiple of each frequency f relative to the two valley values immediately adjacent to it by the peak-to-valley distance, which is defined as:

And a sixth step: adding disturbance of fundamental frequency to form fundamental frequency set S _f

For fundamental frequency f _o，def Adding disturbance, respectively adding frequency offsets of +/-20, +/-40 and +/-60, namely Mel scale offset, to obtain a disturbed fundamental frequency f _o With the original fundamental frequency f _o，def Jointly form a fundamental frequency set S _f 。

The seventh step: normalizing the frequency by using the fundamental frequency set to construct a frequency set S _F

Using a set of fundamental frequencies S _f Normalizing the frequency characteristics, extracting the base frequency of each frame by the method in the sixth step for the spectrogram corresponding to the current audio, counting the median of the base frequency, and recording the median as f _o，audio Respectively using the base frequency set S _f Each value in (a) is normalized to the frequency value at the mel-scale:

f _norm ＝f _orig -(f _o，audio -f _o，def )

f obtained by normalization _norm Form a frequency set S _F 。

Eighth step: using the frequency set S _F Performing acoustic feature extraction

Taking the element in the frequency set as a reference frequency, performing Fast Fourier Transform (FFT) on the signal to convert the signal into energy distribution in the frequency domain, wherein different energy distributions can represent the characteristics of different voices.

The spectrum is calculated by performing N-point FFT on each frame signal, where N is 256. The fast fourier transform S (t, f) of audio X is represented as:

where x (n) denotes the audio signal at the nth sample position in the t-th frame and w (n-t) denotes the window function.

And then, extracting various acoustic features according to actual tasks to obtain the acoustic features such as Fbank or MFCC and the like as the input of the corresponding model.

The method provided by the invention does not need extra noise data, only needs to operate on the input audio signal, does not need to transform the label, has wider applicability compared with the existing data enhancement method related to the task, and can be used in the common tasks of audio classification, voiceprint recognition or voice recognition.

Claims

1. A hierarchical fusion audio data enhancement method is characterized by comprising the following steps:

When the audio X is subjected to time domain signal companding, a waveform similarity overlapping and overlapping algorithm is adopted, a series of distortions are introduced while harmonic signals are kept, and the companded audio X is obtained _o ；

c) Will be compressed audio X _o Mixed with the original audio X to form a new training set S _x ；

d) For fundamental frequency f _o，def Adding disturbance to form a fundamental frequency set S _f ：

e) Using a set of fundamental frequencies S _f Normalizing the frequency to construct a frequency set S _F ：

Using a set of fundamental frequencies S _f Normalizing the frequency characteristics to correspond to the current audio frequencyUsing step d) to extract the base frequency of each frame, and counting the median of the base frequency, and recording it as f _o，audio Respectively using the base frequency set S _f Each value in (a) is normalized to the frequency value at the mel-scale:

f _norm ＝f _orig -(f _o，audio -f _o，def )

in the formula, f _orig Representing frequency values in the Mel scale, f _o，audio Representing the median of the fundamental frequencies of all frames in the current audio, f _o，def Representing a default fundamental frequency;

f obtained by normalization _norm Form a frequency set S _F ；

f) Using the frequency set S _F Performing acoustic feature extraction

By the frequency set S _F The elements in (2) are used as references, the signals are subjected to fast Fourier transform and are converted into energy distribution on a frequency domain, and different energy distributions can represent the characteristics of different voices.

2. The method as claimed in claim 1, wherein any audio frame in the original audio X is defined as a first audio frame, and step b) comprises the following steps:

3. The method of claim 1, wherein in step c), the audio data is compressed audio X _o Adding the same label as the original audio X and adding the same label to the original data set to jointly form a new training set S _x 。

4. The method of claim 1, wherein in step d), the training set S is applied to a hierarchical fusion audio data enhancement method _x The audio in the system is subjected to framing, windowing and Mel scale transformation, and audio frequency characteristics are extracted, so that the frequency f is obtained.

5. A method as claimed in claim 1, wherein in step e), the fundamental frequency estimation is performed by using a SWIPE algorithm, and the significance of the peak value of the amplitude spectrum at an integer multiple of each frequency f relative to the two valley values immediately adjacent to it is measured by the peak-to-valley distance, which is defined as:

the final estimated fundamental frequency based on the saliency is denoted as f _o，def The fundamental frequency f _o，def For subsequent feature normalization.

6. A hierarchical fused audio data enhancement system, comprising:

a signal companding unit: is used for carrying out time domain signal companding on the audio X to obtain the audio X after companding _o Wherein, when the audio X is subjected to time domain signal companding, the waveform similarity overlapping superposition calculation is adoptedMethod for obtaining companded audio X by introducing a series of distortions while preserving harmonic signals _o ；

A fundamental frequency disturbance adding unit: for the fundamental frequency f _o，def Adding disturbance to form a fundamental frequency set S _f Wherein for the fundamental frequency f _o，def When adding disturbance, respectively adding frequency offsets of +/-20, +/-40 and +/-60 to obtain disturbed fundamental frequency f _o With the original fundamental frequency f _o，def Jointly form a fundamental frequency set S _f ；

A frequency normalization unit: for using the set of fundamental frequencies S _f Normalizing the frequency to construct a frequency set S _F Wherein a set of fundamental frequencies S is used _f When the frequency characteristic is normalized, for the spectrogram corresponding to the current audio frequency, the step d) is used for extracting the fundamental frequency of each frame, and the median of the fundamental frequency is counted and recorded as f _o，audio Respectively using the base frequency set S _f Each value in (a) is normalized to the frequency value at the mel-scale:

f _norm ＝f _orig -(f _o，audio -f _o，def )

in the formula, f _orig Representing frequency values in the Mel scale, f _o，audio Representing the median of the fundamental frequencies of all frames in the current audio, f _o，def Representing a default fundamental frequency; f obtained by normalization _norm Form a frequency set S _F

An acoustic feature extraction unit: for using the frequency set S _F Extracting acoustic features, wherein the frequency set S is used for extraction _F Taking the element as a reference to perform fast Fourier transform on the signal to convert the signal into energy distribution on a frequency domain, wherein different energy distributions can represent different energyThe nature of the speech.