CN114937459A - Hierarchical fusion audio data enhancement method and system - Google Patents
Hierarchical fusion audio data enhancement method and system Download PDFInfo
- Publication number
- CN114937459A CN114937459A CN202210458199.7A CN202210458199A CN114937459A CN 114937459 A CN114937459 A CN 114937459A CN 202210458199 A CN202210458199 A CN 202210458199A CN 114937459 A CN114937459 A CN 114937459A
- Authority
- CN
- China
- Prior art keywords
- audio
- frequency
- fundamental
- fundamental frequency
- def
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000004927 fusion Effects 0.000 title claims abstract description 10
- 238000012549 training Methods 0.000 claims abstract description 22
- 230000005236 sound signal Effects 0.000 claims abstract description 15
- 238000010276 construction Methods 0.000 claims abstract description 5
- 238000000605 extraction Methods 0.000 claims description 16
- 238000010606 normalization Methods 0.000 claims description 13
- 238000009826 distribution Methods 0.000 claims description 10
- 238000001228 spectrum Methods 0.000 claims description 7
- 238000009432 framing Methods 0.000 claims description 5
- 230000009466 transformation Effects 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims 1
- 230000006835 compression Effects 0.000 description 3
- 238000007906 compression Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention provides a hierarchical fusion audio data enhancement method and a hierarchical fusion audio data enhancement system, wherein the method comprises the following steps: collecting an original audio signal X; performing time domain signal companding on the audio signal X by using a WSOLA algorithm to obtain the companded audio signal X o (ii) a The processed audio X o Mixed with the original audio X to form a new training set S x (ii) a Extracting the frequency of each audio in the training set; fundamental frequency estimation and fundamental frequency set S construction by using SWIPE algorithm f (ii) a Normalizing the frequency by using the base frequency added with the disturbance to construct a frequency set S F (ii) a Acoustic features are extracted using a fast fourier transform. The audio data enhancement method provided by the invention improves the noise frequency of the modelCan be adapted to a variety of speech tasks, including but not limited to: audio classification, voiceprint recognition, speech recognition, and the like.
Description
Technical Field
The invention relates to an audio data enhancement method applicable to various audio tasks, and belongs to the field of audio data processing.
Background
Currently, most audio tasks rely on the amount of marked data, and the larger the amount of data is, the better the model is. For audio tasks under low resource conditions, data enhancement is a simple and effective method for constructing new samples. By utilizing a data enhancement technology, the model can extract stable voice representation under the condition of a small sample, and compared with an original training method, the recognition effect is greatly improved.
Existing research focuses on data enhancement in both front-end and feature aspects to improve recognition performance in unknown environments. The addition of parameterized reverberation, offset and speed disturbance to the original audio can simulate the noise in the real environment, and the data enhancement by the method can greatly improve the robustness of the model. In addition, augmenting the data using vocal tract length perturbation techniques has also proven effective. In addition, a data enhancement method based on signal compression, which achieves data enhancement by compression and expansion of an original signal based on a-law and μ -law signal compression methods, has been successfully used in the field of sound attack detection.
In addition to data enhancement at the front end, data enhancement can also be performed on audio features, the existing research realizes the data enhancement of the audio features based on fundamental frequency normalization, and a plurality of similar frequencies are constructed by adding different degrees of disturbance to the fundamental frequency. The method can achieve good effect on the voice recognition task.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: for the audio classification task: on one hand, the cost for acquiring a large amount of external noise data to realize data enhancement is high; on the other hand, the samples constructed by the existing method are relatively limited.
In order to solve the above technical problem, an aspect of the present invention provides a hierarchical fusion audio data enhancement method, including the following steps:
a) collecting an original signal, and storing the original signal as audio X in a digital signal form;
b) performing time domain signal companding on the audio X to obtain the companded audio X o :
When the time domain signal companding is carried out on the audio X, a waveform similarity overlapping and overlapping algorithm is adopted, a series of distortions are introduced while harmonic signals are kept, and the companding is obtainedRear audio X o ;
c) The compressed audio frequency X o Mixed with the original audio X to form a new training set S x ;
d) For training set S x Extracting the frequency of each audio frequency to obtain the frequency f;
e) extracting fundamental frequency of each audio to obtain fundamental frequency f o,def ;
d) For fundamental frequency f o,def Disturbance addition to form a fundamental frequency set S f :
For fundamental frequency f o,def Adding disturbance, respectively adding frequency offsets of +/-20, +/-40 and +/-60 to obtain disturbed fundamental frequency f o With the original fundamental frequency f o,def Jointly form a fundamental frequency set S f ;
e) Using the set of fundamental frequencies S f Normalizing the frequency to construct a frequency set S F :
Using the set of fundamental frequencies S f Normalizing the frequency characteristics, extracting the fundamental frequency of each frame by using the step d) for the spectrogram corresponding to the current audio frequency, counting the median of the fundamental frequency, and recording the median as f o,audio Respectively using the base frequency set S f Normalized to frequency values at the mel scale:
f norm =f orig -(f o,audio -f o,def )
in the formula (f) orig Representing frequency values in the Mel scale, f o,audio Representing the median of the fundamental frequencies of all frames in the current audio, f o,def Representing a default fundamental frequency;
f obtained by normalization norm Form a frequency set S F ;
f) Using the frequency set S F Performing acoustic feature extraction
Using the frequency set S F The elements in (1) perform a fast fourier transform of the audio signal into an energy distribution in the frequency domain, different energy distributions being representative of different speech characteristics.
Preferably, any one audio frame in the original audio X is defined as a first audio frame, and then the step b) specifically includes the following steps:
selecting a second audio frame in the left and right range of the first audio frame, wherein the phase parameter of the second audio frame is aligned with the phase of the first audio frame;
searching a third audio frame in a range [ - Δ max, Δ max ], setting the size of Δ max to be half an audio cycle, then calculating a cross-correlation coefficient between frames in the range, and selecting the third audio frame with the highest similarity to the second audio frame;
and splicing the first audio frame, the second audio frame and the third audio frame by the same step length, and adding the overlapped parts.
Preferably, in step c), the compressed audio X is o Adding the same label as the original audio X and adding the same label to the original data set to jointly form a new training set S x 。
Preferably, in step d), training set S is applied x The audio in the system is subjected to framing, windowing and Mel scale transformation, and audio frequency characteristics are extracted, so that the frequency f is obtained.
Preferably, in step e), the fundamental frequency estimation is performed using the SWIPE algorithm, and the significance of the peak of the amplitude spectrum at an integer multiple of each frequency f relative to the two valleys immediately adjacent to it is measured by the peak-to-valley distance, which is defined as:
in the formula: d k (f) Represents the peak-to-valley distance; k 1, 2., n, representing the multiple, | x (kf) | representing the peak of the amplitude spectrum at k times the frequency f;
the significance is represented by the average peak-to-valley distance of each harmonic as shown in the following formula:
the final estimated fundamental frequency based on saliency is denoted as f o,def The fundamental frequency f o,def For subsequent feature normalization.
Another technical solution of the present invention is to provide a hierarchical fusion audio data enhancement system, which is characterized by comprising:
a signal companding unit: is used for carrying out time domain signal companding on the audio X to obtain the companded audio X o When the audio X is subjected to time domain signal companding, a waveform similarity overlapping and overlapping algorithm is adopted, a series of distortions are introduced while harmonic signals are kept, and the companded audio X is obtained o ;
A training set construction unit: for compressing the expanded audio X o Mixed with the original audio X to form a new training set S x ;
A frequency extraction unit: for extracting training set S x The frequency of each audio in the audio signal is obtained to obtain a frequency f;
a fundamental frequency extraction unit: for extracting fundamental frequency of each audio to obtain fundamental frequency f o,def ;
A fundamental frequency disturbance adding unit: for the fundamental frequency f o,def Adding disturbance to form a fundamental frequency set S f Wherein for the fundamental frequency f o,def When disturbance is added, frequency offsets of +/-20 +/-40 +/-60 are respectively added to obtain disturbed fundamental frequency f o With the original fundamental frequency f o,def Jointly form a fundamental frequency set S f ;
A frequency normalization unit: for using the set of fundamental frequencies S f Normalizing the frequency to construct a frequency set S F Wherein a set of fundamental frequencies S is used f When the frequency characteristics are normalized, for the spectrogram corresponding to the current audio frequency, the step d) is used for extracting the fundamental frequency of each frame, and the median of the fundamental frequency is counted and recorded as f o,audio Respectively using the base frequency set S f Normalized to frequency values at the mel scale:
f norm =f orig -(f o,audio -f o,def )
in the formula (f) orig Representing frequency values in the Mel scale, f o,audio Representing the median of the fundamental frequencies of all frames in the current audio, f o,def Representing a default fundamental frequency; f obtained by normalization norm Form a frequency set S F
An acoustic feature extraction unit: for using the frequency set S F Extracting acoustic features, wherein the frequency set S is used for extraction F The elements in (1) are used as reference to perform fast Fourier transform on the signal to convert the signal into energy distribution on a frequency domain, and different energy distributions can represent the characteristics of different voices.
The invention provides a method for enhancing data by hierarchical fusion. Time-scale modification (TSM) is a technique that can change the speed of audio without changing its pitch, and is very important in current audio signal processing. The TSM algorithm equally divides a segment of an audio signal into different frames, then performs a series of processing, such as stretching, compressing, etc., on each frame, and finally re-superimposes the frames into a composite signal. The method realizes companding of the audio signal based on the waveform similarity overlap-add algorithm, applies the method to data enhancement, applies the processed signal to construction of new audio, and realizes data enhancement under the condition of not using additional audio. The audio signals are used for subsequent feature extraction, in the extraction process, data enhancement at a feature level is carried out, the robustness of the model against noise is improved by adding tiny disturbance on sound frequency, and the fundamental frequency normalization-based method can be used for constructing a new sample similar to the original audio and reducing the influence of audio difference caused by environmental noise on the recognition result.
The invention provides a hierarchical fusion audio data enhancement method, which fuses the existing audio data enhancement method, divides the enhancement into different stages, can be used for various audio tasks and has certain universality. The invention can realize audio data enhancement without using extra noise data, and reduce the dependence degree of the existing data enhancement method on external noise data.
Drawings
FIG. 1 is a flow chart of a method of audio data enhancement of the present invention;
fig. 2 is a flow chart of the computation of audio feature data enhancement of the present invention.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It is to be understood that these examples are included merely for the purpose of promoting an understanding of the principles and knowledge of the invention, and are not intended to limit the scope of the invention or to limit the application of the invention. Further, it will be appreciated that various alterations and modifications of the invention will occur to those skilled in the art upon reading the teachings of the invention, but it is intended that variations, changes and modifications of the embodiments based on the principles and spirit of the invention will fall within the scope of the appended claims. And it is to be understood that this description is by way of example only of preferred embodiments and that not all embodiments need be exhaustive.
As shown in fig. 1, the present invention provides a process of a method for enhancing audio data, which includes: the audio signal data enhancement and the audio characteristic data enhancement are two stages. Wherein the audio signal data enhancement comprises signal companding and training set construction; the audio characteristic data enhancement comprises frequency extraction, fundamental frequency disturbance addition, frequency normalization and acoustic characteristic extraction. The specific implementation process is as follows:
the first step is as follows: the raw signal is collected by a sensor and stored as audio X in the form of a digital signal.
The second step is that: performing time domain signal companding on the audio X to obtain the companded audio X o
Adopting waveform similar overlapping algorithm, while retaining harmonic signal, introducing a series of distortions so as to obtain companded audio frequency X o . Defining any one audio frame in the original audio as a first audio frame, and then the specific operation of the second step is as follows:
(1) selecting a second audio frame in the left and right range of the first audio frame, wherein the phase parameter of the second audio frame is aligned with the phase of the first audio frame;
(2) searching a third audio frame in a range [ - Δ max, Δ max ], setting the size of Δ max to be half an audio cycle, then calculating a cross-correlation coefficient between frames in the range, and selecting the third audio frame with the highest similarity to the second audio frame;
(3) and splicing the first audio frame, the second audio frame and the third audio frame by the same step length, and adding the overlapped parts.
The third step: the compressed audio frequency X o Mixed with the original audio X to form a new training set S x
For the compressed audio X o Adding the same label as the original audio X and adding the same label to the original data set to jointly form a new training set S x 。
The fourth step: for training set S x Extracting the frequency of each audio frequency to obtain the frequency f;
for training set S x The audio frequency is subjected to operations such as framing, windowing, mel scale transformation and the like, and audio frequency features are extracted, wherein the basic flow is shown in fig. 2, and the specific operations are as follows:
(1) audio framing
Because the audio signal is unstable on the whole, the frequency domain conversion cannot be directly carried out on the whole voice, a section of audio is divided into a plurality of short segments by frames, and each section of short audio is regarded as stable. After framing is completed, the speech can be treated as a stable signal. In the present embodiment, 25ms is used as the frame length, but in order to avoid abrupt signal changes from frame to frame, one frame is taken every 10ms, that is, the frame is shifted to 10 ms.
(2) Window with window
The window function is used to slide over the sliced frames so that the amplitude of a frame signal is gradually changed to zero at both ends, thereby highlighting the middle portion of a frame, which can improve the resolution of the subsequent fourier transform results. Due to the effect of the window function, the amplitude at two ends of each frame can be attenuated, and the frame division can enable two adjacent frames to have an overlapping part, so that each sampling point can obtain the prominence of the window function.
In this embodiment, a hamming window is used in the window function w (n), as shown in the following formula:
in the formula: n represents the position of the (n + 1) th sampling point on one frame signal; n represents the hamming window length, i.e., the total number of sample points, which is 256 in this embodiment.
(3) Mel scale conversion
For the frequencies contained in the current frame, the transform is performed using the mel scale, as shown in the following equation:
in the formula: f is the transformed frequency; f' is the frequency before transformation.
The fifth step: fundamental frequency extraction is performed on each audio frequency
This example uses the SWIPE algorithm for fundamental frequency estimation, and measures the significance of the peak value of the amplitude spectrum at an integer multiple of each frequency f relative to the two valley values immediately adjacent to it by the peak-to-valley distance, which is defined as:
in the formula: d k (f) Represents the peak-to-valley distance; k 1, 2., n, representing the multiple, | x (kf) | representing the peak of the amplitude spectrum at k times the frequency f;
the significance is represented by the average peak-to-valley distance of each harmonic as shown in the following formula:
the final estimated fundamental frequency based on saliency is denoted as f o,def The fundamental frequency f o,def For subsequent feature normalization.
And a sixth step: adding disturbance of fundamental frequency to form fundamental frequency set S f
For fundamental frequency f o,def Adding disturbance, respectively adding frequency offsets of +/-20, +/-40 and +/-60, namely Mel scale offset, to obtain a disturbed fundamental frequency f o With the original fundamental frequency f o,def Jointly form a fundamental frequency set S f 。
The seventh step: normalizing the frequency by using the fundamental frequency set to construct a frequency set S F
Using a set of fundamental frequencies S f Normalizing the frequency characteristics, extracting the base frequency of each frame by the method in the sixth step for the spectrogram corresponding to the current audio, counting the median of the base frequency, and recording the median as f o,audio Respectively using the base frequency set S f Each value in (a) is normalized to the frequency value at the mel-scale:
f norm =f orig -(f o,audio -f o,def )
in the formula (f) orig Representing frequency values in the Mel scale, f o,audio Representing the median of the fundamental frequencies of all frames in the current audio, f o,def Representing a default fundamental frequency;
f obtained by normalization norm Form a frequency set S F 。
Eighth step: using the frequency set S F Performing acoustic feature extraction
Taking the element in the frequency set as a reference frequency, performing Fast Fourier Transform (FFT) on the signal to convert the signal into energy distribution in the frequency domain, wherein different energy distributions can represent the characteristics of different voices.
The spectrum is calculated by performing N-point FFT on each frame signal, where N is 256. The fast fourier transform S (t, f) of audio X is represented as:
where x (n) denotes the audio signal at the nth sample position in the t-th frame and w (n-t) denotes the window function.
And then, extracting various acoustic features according to actual tasks to obtain the acoustic features such as Fbank or MFCC and the like as the input of the corresponding model.
The method provided by the invention does not need extra noise data, only needs to operate on the input audio signal, does not need to transform the label, has wider applicability compared with the existing data enhancement method related to the task, and can be used in the common tasks of audio classification, voiceprint recognition or voice recognition.
Claims (6)
1. A hierarchical fusion audio data enhancement method is characterized by comprising the following steps:
a) collecting an original signal, and storing the original signal as audio X in a digital signal form;
b) performing time domain signal companding on the audio X to obtain the companded audio X o :
When the audio X is subjected to time domain signal companding, a waveform similarity overlapping and overlapping algorithm is adopted, a series of distortions are introduced while harmonic signals are kept, and the companded audio X is obtained o ;
c) Will be compressed audio X o Mixed with the original audio X to form a new training set S x ;
d) For training set S x Extracting the frequency of each audio frequency to obtain the frequency f;
e) extracting fundamental frequency of each audio to obtain fundamental frequency f o,def ;
d) For fundamental frequency f o,def Adding disturbance to form a fundamental frequency set S f :
For fundamental frequency f o,def Adding disturbance, respectively adding frequency offsets of +/-20, +/-40 and +/-60 to obtain disturbed fundamental frequency f o With the original fundamental frequency f o,def Jointly form a fundamental frequency set S f ;
e) Using a set of fundamental frequencies S f Normalizing the frequency to construct a frequency set S F :
Using a set of fundamental frequencies S f Normalizing the frequency characteristics to correspond to the current audio frequencyUsing step d) to extract the base frequency of each frame, and counting the median of the base frequency, and recording it as f o,audio Respectively using the base frequency set S f Each value in (a) is normalized to the frequency value at the mel-scale:
f norm =f orig -(f o,audio -f o,def )
in the formula, f orig Representing frequency values in the Mel scale, f o,audio Representing the median of the fundamental frequencies of all frames in the current audio, f o,def Representing a default fundamental frequency;
f obtained by normalization norm Form a frequency set S F ;
f) Using the frequency set S F Performing acoustic feature extraction
By the frequency set S F The elements in (2) are used as references, the signals are subjected to fast Fourier transform and are converted into energy distribution on a frequency domain, and different energy distributions can represent the characteristics of different voices.
2. The method as claimed in claim 1, wherein any audio frame in the original audio X is defined as a first audio frame, and step b) comprises the following steps:
selecting a second audio frame in the left and right range of the first audio frame, wherein the phase parameter of the second audio frame is aligned with the phase of the first audio frame;
searching a third audio frame in a range [ - Δ max, Δ max ], setting the size of Δ max to be half an audio cycle, then calculating a cross-correlation coefficient between frames in the range, and selecting the third audio frame with the highest similarity to the second audio frame;
and splicing the first audio frame, the second audio frame and the third audio frame by the same step length, and adding the overlapped parts.
3. The method of claim 1, wherein in step c), the audio data is compressed audio X o Adding the same label as the original audio X and adding the same label to the original data set to jointly form a new training set S x 。
4. The method of claim 1, wherein in step d), the training set S is applied to a hierarchical fusion audio data enhancement method x The audio in the system is subjected to framing, windowing and Mel scale transformation, and audio frequency characteristics are extracted, so that the frequency f is obtained.
5. A method as claimed in claim 1, wherein in step e), the fundamental frequency estimation is performed by using a SWIPE algorithm, and the significance of the peak value of the amplitude spectrum at an integer multiple of each frequency f relative to the two valley values immediately adjacent to it is measured by the peak-to-valley distance, which is defined as:
in the formula: d k (f) Represents the peak-to-valley distance; k 1, 2., n, representing the multiple, | x (kf) | representing the peak of the amplitude spectrum at k times the frequency f;
the significance is represented by the average peak-to-valley distance of each harmonic as shown in the following formula:
the final estimated fundamental frequency based on the saliency is denoted as f o,def The fundamental frequency f o,def For subsequent feature normalization.
6. A hierarchical fused audio data enhancement system, comprising:
a signal companding unit: is used for carrying out time domain signal companding on the audio X to obtain the audio X after companding o Wherein, when the audio X is subjected to time domain signal companding, the waveform similarity overlapping superposition calculation is adoptedMethod for obtaining companded audio X by introducing a series of distortions while preserving harmonic signals o ;
A training set construction unit: for compressing the expanded audio X o Mixed with the original audio X to form a new training set S x ;
A frequency extraction unit: for extracting training set S x The frequency of each audio in the audio signal is obtained to obtain a frequency f;
a fundamental frequency extraction unit: for extracting fundamental frequency of each audio to obtain fundamental frequency f o,def ;
A fundamental frequency disturbance adding unit: for the fundamental frequency f o,def Adding disturbance to form a fundamental frequency set S f Wherein for the fundamental frequency f o,def When adding disturbance, respectively adding frequency offsets of +/-20, +/-40 and +/-60 to obtain disturbed fundamental frequency f o With the original fundamental frequency f o,def Jointly form a fundamental frequency set S f ;
A frequency normalization unit: for using the set of fundamental frequencies S f Normalizing the frequency to construct a frequency set S F Wherein a set of fundamental frequencies S is used f When the frequency characteristic is normalized, for the spectrogram corresponding to the current audio frequency, the step d) is used for extracting the fundamental frequency of each frame, and the median of the fundamental frequency is counted and recorded as f o,audio Respectively using the base frequency set S f Each value in (a) is normalized to the frequency value at the mel-scale:
f norm =f orig -(f o,audio -f o,def )
in the formula, f orig Representing frequency values in the Mel scale, f o,audio Representing the median of the fundamental frequencies of all frames in the current audio, f o,def Representing a default fundamental frequency; f obtained by normalization norm Form a frequency set S F
An acoustic feature extraction unit: for using the frequency set S F Extracting acoustic features, wherein the frequency set S is used for extraction F Taking the element as a reference to perform fast Fourier transform on the signal to convert the signal into energy distribution on a frequency domain, wherein different energy distributions can represent different energyThe nature of the speech.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210458199.7A CN114937459A (en) | 2022-04-28 | 2022-04-28 | Hierarchical fusion audio data enhancement method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210458199.7A CN114937459A (en) | 2022-04-28 | 2022-04-28 | Hierarchical fusion audio data enhancement method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114937459A true CN114937459A (en) | 2022-08-23 |
Family
ID=82863224
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210458199.7A Pending CN114937459A (en) | 2022-04-28 | 2022-04-28 | Hierarchical fusion audio data enhancement method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114937459A (en) |
-
2022
- 2022-04-28 CN CN202210458199.7A patent/CN114937459A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109147796B (en) | Speech recognition method, device, computer equipment and computer readable storage medium | |
CN111816218A (en) | Voice endpoint detection method, device, equipment and storage medium | |
CN103559882B (en) | A kind of meeting presider's voice extraction method based on speaker's segmentation | |
CN109192200B (en) | Speech recognition method | |
CN101226743A (en) | Method for recognizing speaker based on conversion of neutral and affection sound-groove model | |
US8566084B2 (en) | Speech processing based on time series of maximum values of cross-power spectrum phase between two consecutive speech frames | |
CN108305639B (en) | Speech emotion recognition method, computer-readable storage medium and terminal | |
CN103117059A (en) | Voice signal characteristics extracting method based on tensor decomposition | |
CN101221762A (en) | MP3 compression field audio partitioning method | |
CN108682432B (en) | Speech emotion recognition device | |
CN110647656B (en) | Audio retrieval method utilizing transform domain sparsification and compression dimension reduction | |
CN112071308A (en) | Awakening word training method based on speech synthesis data enhancement | |
CN110570870A (en) | Text-independent voiceprint recognition method, device and equipment | |
CN107103913B (en) | Speech recognition method based on power spectrum Gabor characteristic sequence recursion model | |
CN112786054A (en) | Intelligent interview evaluation method, device and equipment based on voice and storage medium | |
CN116884431A (en) | CFCC (computational fluid dynamics) feature-based robust audio copy-paste tamper detection method and device | |
Akdeniz et al. | Linear prediction coefficients based copy-move forgery detection in audio signal | |
CN114937459A (en) | Hierarchical fusion audio data enhancement method and system | |
CN111402898B (en) | Audio signal processing method, device, equipment and storage medium | |
Patil et al. | Content-based audio classification and retrieval: A novel approach | |
Aurchana et al. | Musical instruments sound classification using GMM | |
CN112309404A (en) | Machine voice identification method, device, equipment and storage medium | |
Yue et al. | Speaker age recognition based on isolated words by using SVM | |
Therese et al. | A linear visual assessment tendency based clustering with power normalized cepstral coefficients for audio signal recognition system | |
CN110634473A (en) | Voice digital recognition method based on MFCC |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |