CN114937459A - Hierarchical fusion audio data enhancement method and system - Google Patents

Hierarchical fusion audio data enhancement method and system Download PDF

Info

Publication number
CN114937459A
CN114937459A CN202210458199.7A CN202210458199A CN114937459A CN 114937459 A CN114937459 A CN 114937459A CN 202210458199 A CN202210458199 A CN 202210458199A CN 114937459 A CN114937459 A CN 114937459A
Authority
CN
China
Prior art keywords
audio
frequency
fundamental
fundamental frequency
def
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210458199.7A
Other languages
Chinese (zh)
Inventor
武星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202210458199.7A priority Critical patent/CN114937459A/en
Publication of CN114937459A publication Critical patent/CN114937459A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention provides a hierarchical fusion audio data enhancement method and a hierarchical fusion audio data enhancement system, wherein the method comprises the following steps: collecting an original audio signal X; performing time domain signal companding on the audio signal X by using a WSOLA algorithm to obtain the companded audio signal X o (ii) a The processed audio X o Mixed with the original audio X to form a new training set S x (ii) a Extracting the frequency of each audio in the training set; fundamental frequency estimation and fundamental frequency set S construction by using SWIPE algorithm f (ii) a Normalizing the frequency by using the base frequency added with the disturbance to construct a frequency set S F (ii) a Acoustic features are extracted using a fast fourier transform. The audio data enhancement method provided by the invention improves the noise frequency of the modelCan be adapted to a variety of speech tasks, including but not limited to: audio classification, voiceprint recognition, speech recognition, and the like.

Description

Hierarchical fusion audio data enhancement method and system
Technical Field
The invention relates to an audio data enhancement method applicable to various audio tasks, and belongs to the field of audio data processing.
Background
Currently, most audio tasks rely on the amount of marked data, and the larger the amount of data is, the better the model is. For audio tasks under low resource conditions, data enhancement is a simple and effective method for constructing new samples. By utilizing a data enhancement technology, the model can extract stable voice representation under the condition of a small sample, and compared with an original training method, the recognition effect is greatly improved.
Existing research focuses on data enhancement in both front-end and feature aspects to improve recognition performance in unknown environments. The addition of parameterized reverberation, offset and speed disturbance to the original audio can simulate the noise in the real environment, and the data enhancement by the method can greatly improve the robustness of the model. In addition, augmenting the data using vocal tract length perturbation techniques has also proven effective. In addition, a data enhancement method based on signal compression, which achieves data enhancement by compression and expansion of an original signal based on a-law and μ -law signal compression methods, has been successfully used in the field of sound attack detection.
In addition to data enhancement at the front end, data enhancement can also be performed on audio features, the existing research realizes the data enhancement of the audio features based on fundamental frequency normalization, and a plurality of similar frequencies are constructed by adding different degrees of disturbance to the fundamental frequency. The method can achieve good effect on the voice recognition task.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: for the audio classification task: on one hand, the cost for acquiring a large amount of external noise data to realize data enhancement is high; on the other hand, the samples constructed by the existing method are relatively limited.
In order to solve the above technical problem, an aspect of the present invention provides a hierarchical fusion audio data enhancement method, including the following steps:
a) collecting an original signal, and storing the original signal as audio X in a digital signal form;
b) performing time domain signal companding on the audio X to obtain the companded audio X o
When the time domain signal companding is carried out on the audio X, a waveform similarity overlapping and overlapping algorithm is adopted, a series of distortions are introduced while harmonic signals are kept, and the companding is obtainedRear audio X o
c) The compressed audio frequency X o Mixed with the original audio X to form a new training set S x
d) For training set S x Extracting the frequency of each audio frequency to obtain the frequency f;
e) extracting fundamental frequency of each audio to obtain fundamental frequency f o,def
d) For fundamental frequency f o,def Disturbance addition to form a fundamental frequency set S f
For fundamental frequency f o,def Adding disturbance, respectively adding frequency offsets of +/-20, +/-40 and +/-60 to obtain disturbed fundamental frequency f o With the original fundamental frequency f o,def Jointly form a fundamental frequency set S f
e) Using the set of fundamental frequencies S f Normalizing the frequency to construct a frequency set S F
Using the set of fundamental frequencies S f Normalizing the frequency characteristics, extracting the fundamental frequency of each frame by using the step d) for the spectrogram corresponding to the current audio frequency, counting the median of the fundamental frequency, and recording the median as f o,audio Respectively using the base frequency set S f Normalized to frequency values at the mel scale:
f norm =f orig -(f o,audio -f o,def )
in the formula (f) orig Representing frequency values in the Mel scale, f o,audio Representing the median of the fundamental frequencies of all frames in the current audio, f o,def Representing a default fundamental frequency;
f obtained by normalization norm Form a frequency set S F
f) Using the frequency set S F Performing acoustic feature extraction
Using the frequency set S F The elements in (1) perform a fast fourier transform of the audio signal into an energy distribution in the frequency domain, different energy distributions being representative of different speech characteristics.
Preferably, any one audio frame in the original audio X is defined as a first audio frame, and then the step b) specifically includes the following steps:
selecting a second audio frame in the left and right range of the first audio frame, wherein the phase parameter of the second audio frame is aligned with the phase of the first audio frame;
searching a third audio frame in a range [ - Δ max, Δ max ], setting the size of Δ max to be half an audio cycle, then calculating a cross-correlation coefficient between frames in the range, and selecting the third audio frame with the highest similarity to the second audio frame;
and splicing the first audio frame, the second audio frame and the third audio frame by the same step length, and adding the overlapped parts.
Preferably, in step c), the compressed audio X is o Adding the same label as the original audio X and adding the same label to the original data set to jointly form a new training set S x
Preferably, in step d), training set S is applied x The audio in the system is subjected to framing, windowing and Mel scale transformation, and audio frequency characteristics are extracted, so that the frequency f is obtained.
Preferably, in step e), the fundamental frequency estimation is performed using the SWIPE algorithm, and the significance of the peak of the amplitude spectrum at an integer multiple of each frequency f relative to the two valleys immediately adjacent to it is measured by the peak-to-valley distance, which is defined as:
Figure BDA0003621201420000031
in the formula: d k (f) Represents the peak-to-valley distance; k 1, 2., n, representing the multiple, | x (kf) | representing the peak of the amplitude spectrum at k times the frequency f;
the significance is represented by the average peak-to-valley distance of each harmonic as shown in the following formula:
Figure BDA0003621201420000032
the final estimated fundamental frequency based on saliency is denoted as f o,def The fundamental frequency f o,def For subsequent feature normalization.
Another technical solution of the present invention is to provide a hierarchical fusion audio data enhancement system, which is characterized by comprising:
a signal companding unit: is used for carrying out time domain signal companding on the audio X to obtain the companded audio X o When the audio X is subjected to time domain signal companding, a waveform similarity overlapping and overlapping algorithm is adopted, a series of distortions are introduced while harmonic signals are kept, and the companded audio X is obtained o
A training set construction unit: for compressing the expanded audio X o Mixed with the original audio X to form a new training set S x
A frequency extraction unit: for extracting training set S x The frequency of each audio in the audio signal is obtained to obtain a frequency f;
a fundamental frequency extraction unit: for extracting fundamental frequency of each audio to obtain fundamental frequency f o,def
A fundamental frequency disturbance adding unit: for the fundamental frequency f o,def Adding disturbance to form a fundamental frequency set S f Wherein for the fundamental frequency f o,def When disturbance is added, frequency offsets of +/-20 +/-40 +/-60 are respectively added to obtain disturbed fundamental frequency f o With the original fundamental frequency f o,def Jointly form a fundamental frequency set S f
A frequency normalization unit: for using the set of fundamental frequencies S f Normalizing the frequency to construct a frequency set S F Wherein a set of fundamental frequencies S is used f When the frequency characteristics are normalized, for the spectrogram corresponding to the current audio frequency, the step d) is used for extracting the fundamental frequency of each frame, and the median of the fundamental frequency is counted and recorded as f o,audio Respectively using the base frequency set S f Normalized to frequency values at the mel scale:
f norm =f orig -(f o,audio -f o,def )
in the formula (f) orig Representing frequency values in the Mel scale, f o,audio Representing the median of the fundamental frequencies of all frames in the current audio, f o,def Representing a default fundamental frequency; f obtained by normalization norm Form a frequency set S F
An acoustic feature extraction unit: for using the frequency set S F Extracting acoustic features, wherein the frequency set S is used for extraction F The elements in (1) are used as reference to perform fast Fourier transform on the signal to convert the signal into energy distribution on a frequency domain, and different energy distributions can represent the characteristics of different voices.
The invention provides a method for enhancing data by hierarchical fusion. Time-scale modification (TSM) is a technique that can change the speed of audio without changing its pitch, and is very important in current audio signal processing. The TSM algorithm equally divides a segment of an audio signal into different frames, then performs a series of processing, such as stretching, compressing, etc., on each frame, and finally re-superimposes the frames into a composite signal. The method realizes companding of the audio signal based on the waveform similarity overlap-add algorithm, applies the method to data enhancement, applies the processed signal to construction of new audio, and realizes data enhancement under the condition of not using additional audio. The audio signals are used for subsequent feature extraction, in the extraction process, data enhancement at a feature level is carried out, the robustness of the model against noise is improved by adding tiny disturbance on sound frequency, and the fundamental frequency normalization-based method can be used for constructing a new sample similar to the original audio and reducing the influence of audio difference caused by environmental noise on the recognition result.
The invention provides a hierarchical fusion audio data enhancement method, which fuses the existing audio data enhancement method, divides the enhancement into different stages, can be used for various audio tasks and has certain universality. The invention can realize audio data enhancement without using extra noise data, and reduce the dependence degree of the existing data enhancement method on external noise data.
Drawings
FIG. 1 is a flow chart of a method of audio data enhancement of the present invention;
fig. 2 is a flow chart of the computation of audio feature data enhancement of the present invention.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It is to be understood that these examples are included merely for the purpose of promoting an understanding of the principles and knowledge of the invention, and are not intended to limit the scope of the invention or to limit the application of the invention. Further, it will be appreciated that various alterations and modifications of the invention will occur to those skilled in the art upon reading the teachings of the invention, but it is intended that variations, changes and modifications of the embodiments based on the principles and spirit of the invention will fall within the scope of the appended claims. And it is to be understood that this description is by way of example only of preferred embodiments and that not all embodiments need be exhaustive.
As shown in fig. 1, the present invention provides a process of a method for enhancing audio data, which includes: the audio signal data enhancement and the audio characteristic data enhancement are two stages. Wherein the audio signal data enhancement comprises signal companding and training set construction; the audio characteristic data enhancement comprises frequency extraction, fundamental frequency disturbance addition, frequency normalization and acoustic characteristic extraction. The specific implementation process is as follows:
the first step is as follows: the raw signal is collected by a sensor and stored as audio X in the form of a digital signal.
The second step is that: performing time domain signal companding on the audio X to obtain the companded audio X o
Adopting waveform similar overlapping algorithm, while retaining harmonic signal, introducing a series of distortions so as to obtain companded audio frequency X o . Defining any one audio frame in the original audio as a first audio frame, and then the specific operation of the second step is as follows:
(1) selecting a second audio frame in the left and right range of the first audio frame, wherein the phase parameter of the second audio frame is aligned with the phase of the first audio frame;
(2) searching a third audio frame in a range [ - Δ max, Δ max ], setting the size of Δ max to be half an audio cycle, then calculating a cross-correlation coefficient between frames in the range, and selecting the third audio frame with the highest similarity to the second audio frame;
(3) and splicing the first audio frame, the second audio frame and the third audio frame by the same step length, and adding the overlapped parts.
The third step: the compressed audio frequency X o Mixed with the original audio X to form a new training set S x
For the compressed audio X o Adding the same label as the original audio X and adding the same label to the original data set to jointly form a new training set S x
The fourth step: for training set S x Extracting the frequency of each audio frequency to obtain the frequency f;
for training set S x The audio frequency is subjected to operations such as framing, windowing, mel scale transformation and the like, and audio frequency features are extracted, wherein the basic flow is shown in fig. 2, and the specific operations are as follows:
(1) audio framing
Because the audio signal is unstable on the whole, the frequency domain conversion cannot be directly carried out on the whole voice, a section of audio is divided into a plurality of short segments by frames, and each section of short audio is regarded as stable. After framing is completed, the speech can be treated as a stable signal. In the present embodiment, 25ms is used as the frame length, but in order to avoid abrupt signal changes from frame to frame, one frame is taken every 10ms, that is, the frame is shifted to 10 ms.
(2) Window with window
The window function is used to slide over the sliced frames so that the amplitude of a frame signal is gradually changed to zero at both ends, thereby highlighting the middle portion of a frame, which can improve the resolution of the subsequent fourier transform results. Due to the effect of the window function, the amplitude at two ends of each frame can be attenuated, and the frame division can enable two adjacent frames to have an overlapping part, so that each sampling point can obtain the prominence of the window function.
In this embodiment, a hamming window is used in the window function w (n), as shown in the following formula:
Figure BDA0003621201420000061
in the formula: n represents the position of the (n + 1) th sampling point on one frame signal; n represents the hamming window length, i.e., the total number of sample points, which is 256 in this embodiment.
(3) Mel scale conversion
For the frequencies contained in the current frame, the transform is performed using the mel scale, as shown in the following equation:
Figure BDA0003621201420000062
in the formula: f is the transformed frequency; f' is the frequency before transformation.
The fifth step: fundamental frequency extraction is performed on each audio frequency
This example uses the SWIPE algorithm for fundamental frequency estimation, and measures the significance of the peak value of the amplitude spectrum at an integer multiple of each frequency f relative to the two valley values immediately adjacent to it by the peak-to-valley distance, which is defined as:
Figure BDA0003621201420000063
in the formula: d k (f) Represents the peak-to-valley distance; k 1, 2., n, representing the multiple, | x (kf) | representing the peak of the amplitude spectrum at k times the frequency f;
the significance is represented by the average peak-to-valley distance of each harmonic as shown in the following formula:
Figure BDA0003621201420000064
the final estimated fundamental frequency based on saliency is denoted as f o,def The fundamental frequency f o,def For subsequent feature normalization.
And a sixth step: adding disturbance of fundamental frequency to form fundamental frequency set S f
For fundamental frequency f o,def Adding disturbance, respectively adding frequency offsets of +/-20, +/-40 and +/-60, namely Mel scale offset, to obtain a disturbed fundamental frequency f o With the original fundamental frequency f o,def Jointly form a fundamental frequency set S f
The seventh step: normalizing the frequency by using the fundamental frequency set to construct a frequency set S F
Using a set of fundamental frequencies S f Normalizing the frequency characteristics, extracting the base frequency of each frame by the method in the sixth step for the spectrogram corresponding to the current audio, counting the median of the base frequency, and recording the median as f o,audio Respectively using the base frequency set S f Each value in (a) is normalized to the frequency value at the mel-scale:
f norm =f orig -(f o,audio -f o,def )
in the formula (f) orig Representing frequency values in the Mel scale, f o,audio Representing the median of the fundamental frequencies of all frames in the current audio, f o,def Representing a default fundamental frequency;
f obtained by normalization norm Form a frequency set S F
Eighth step: using the frequency set S F Performing acoustic feature extraction
Taking the element in the frequency set as a reference frequency, performing Fast Fourier Transform (FFT) on the signal to convert the signal into energy distribution in the frequency domain, wherein different energy distributions can represent the characteristics of different voices.
The spectrum is calculated by performing N-point FFT on each frame signal, where N is 256. The fast fourier transform S (t, f) of audio X is represented as:
Figure BDA0003621201420000071
where x (n) denotes the audio signal at the nth sample position in the t-th frame and w (n-t) denotes the window function.
And then, extracting various acoustic features according to actual tasks to obtain the acoustic features such as Fbank or MFCC and the like as the input of the corresponding model.
The method provided by the invention does not need extra noise data, only needs to operate on the input audio signal, does not need to transform the label, has wider applicability compared with the existing data enhancement method related to the task, and can be used in the common tasks of audio classification, voiceprint recognition or voice recognition.

Claims (6)

1. A hierarchical fusion audio data enhancement method is characterized by comprising the following steps:
a) collecting an original signal, and storing the original signal as audio X in a digital signal form;
b) performing time domain signal companding on the audio X to obtain the companded audio X o
When the audio X is subjected to time domain signal companding, a waveform similarity overlapping and overlapping algorithm is adopted, a series of distortions are introduced while harmonic signals are kept, and the companded audio X is obtained o
c) Will be compressed audio X o Mixed with the original audio X to form a new training set S x
d) For training set S x Extracting the frequency of each audio frequency to obtain the frequency f;
e) extracting fundamental frequency of each audio to obtain fundamental frequency f o,def
d) For fundamental frequency f o,def Adding disturbance to form a fundamental frequency set S f
For fundamental frequency f o,def Adding disturbance, respectively adding frequency offsets of +/-20, +/-40 and +/-60 to obtain disturbed fundamental frequency f o With the original fundamental frequency f o,def Jointly form a fundamental frequency set S f
e) Using a set of fundamental frequencies S f Normalizing the frequency to construct a frequency set S F
Using a set of fundamental frequencies S f Normalizing the frequency characteristics to correspond to the current audio frequencyUsing step d) to extract the base frequency of each frame, and counting the median of the base frequency, and recording it as f o,audio Respectively using the base frequency set S f Each value in (a) is normalized to the frequency value at the mel-scale:
f norm =f orig -(f o,audio -f o,def )
in the formula, f orig Representing frequency values in the Mel scale, f o,audio Representing the median of the fundamental frequencies of all frames in the current audio, f o,def Representing a default fundamental frequency;
f obtained by normalization norm Form a frequency set S F
f) Using the frequency set S F Performing acoustic feature extraction
By the frequency set S F The elements in (2) are used as references, the signals are subjected to fast Fourier transform and are converted into energy distribution on a frequency domain, and different energy distributions can represent the characteristics of different voices.
2. The method as claimed in claim 1, wherein any audio frame in the original audio X is defined as a first audio frame, and step b) comprises the following steps:
selecting a second audio frame in the left and right range of the first audio frame, wherein the phase parameter of the second audio frame is aligned with the phase of the first audio frame;
searching a third audio frame in a range [ - Δ max, Δ max ], setting the size of Δ max to be half an audio cycle, then calculating a cross-correlation coefficient between frames in the range, and selecting the third audio frame with the highest similarity to the second audio frame;
and splicing the first audio frame, the second audio frame and the third audio frame by the same step length, and adding the overlapped parts.
3. The method of claim 1, wherein in step c), the audio data is compressed audio X o Adding the same label as the original audio X and adding the same label to the original data set to jointly form a new training set S x
4. The method of claim 1, wherein in step d), the training set S is applied to a hierarchical fusion audio data enhancement method x The audio in the system is subjected to framing, windowing and Mel scale transformation, and audio frequency characteristics are extracted, so that the frequency f is obtained.
5. A method as claimed in claim 1, wherein in step e), the fundamental frequency estimation is performed by using a SWIPE algorithm, and the significance of the peak value of the amplitude spectrum at an integer multiple of each frequency f relative to the two valley values immediately adjacent to it is measured by the peak-to-valley distance, which is defined as:
Figure FDA0003621201410000021
in the formula: d k (f) Represents the peak-to-valley distance; k 1, 2., n, representing the multiple, | x (kf) | representing the peak of the amplitude spectrum at k times the frequency f;
the significance is represented by the average peak-to-valley distance of each harmonic as shown in the following formula:
Figure FDA0003621201410000022
the final estimated fundamental frequency based on the saliency is denoted as f o,def The fundamental frequency f o,def For subsequent feature normalization.
6. A hierarchical fused audio data enhancement system, comprising:
a signal companding unit: is used for carrying out time domain signal companding on the audio X to obtain the audio X after companding o Wherein, when the audio X is subjected to time domain signal companding, the waveform similarity overlapping superposition calculation is adoptedMethod for obtaining companded audio X by introducing a series of distortions while preserving harmonic signals o
A training set construction unit: for compressing the expanded audio X o Mixed with the original audio X to form a new training set S x
A frequency extraction unit: for extracting training set S x The frequency of each audio in the audio signal is obtained to obtain a frequency f;
a fundamental frequency extraction unit: for extracting fundamental frequency of each audio to obtain fundamental frequency f o,def
A fundamental frequency disturbance adding unit: for the fundamental frequency f o,def Adding disturbance to form a fundamental frequency set S f Wherein for the fundamental frequency f o,def When adding disturbance, respectively adding frequency offsets of +/-20, +/-40 and +/-60 to obtain disturbed fundamental frequency f o With the original fundamental frequency f o,def Jointly form a fundamental frequency set S f
A frequency normalization unit: for using the set of fundamental frequencies S f Normalizing the frequency to construct a frequency set S F Wherein a set of fundamental frequencies S is used f When the frequency characteristic is normalized, for the spectrogram corresponding to the current audio frequency, the step d) is used for extracting the fundamental frequency of each frame, and the median of the fundamental frequency is counted and recorded as f o,audio Respectively using the base frequency set S f Each value in (a) is normalized to the frequency value at the mel-scale:
f norm =f orig -(f o,audio -f o,def )
in the formula, f orig Representing frequency values in the Mel scale, f o,audio Representing the median of the fundamental frequencies of all frames in the current audio, f o,def Representing a default fundamental frequency; f obtained by normalization norm Form a frequency set S F
An acoustic feature extraction unit: for using the frequency set S F Extracting acoustic features, wherein the frequency set S is used for extraction F Taking the element as a reference to perform fast Fourier transform on the signal to convert the signal into energy distribution on a frequency domain, wherein different energy distributions can represent different energyThe nature of the speech.
CN202210458199.7A 2022-04-28 2022-04-28 Hierarchical fusion audio data enhancement method and system Pending CN114937459A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210458199.7A CN114937459A (en) 2022-04-28 2022-04-28 Hierarchical fusion audio data enhancement method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210458199.7A CN114937459A (en) 2022-04-28 2022-04-28 Hierarchical fusion audio data enhancement method and system

Publications (1)

Publication Number Publication Date
CN114937459A true CN114937459A (en) 2022-08-23

Family

ID=82863224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210458199.7A Pending CN114937459A (en) 2022-04-28 2022-04-28 Hierarchical fusion audio data enhancement method and system

Country Status (1)

Country Link
CN (1) CN114937459A (en)

Similar Documents

Publication Publication Date Title
CN109147796B (en) Speech recognition method, device, computer equipment and computer readable storage medium
CN111816218A (en) Voice endpoint detection method, device, equipment and storage medium
CN103559882B (en) A kind of meeting presider's voice extraction method based on speaker's segmentation
CN109192200B (en) Speech recognition method
CN101226743A (en) Method for recognizing speaker based on conversion of neutral and affection sound-groove model
US8566084B2 (en) Speech processing based on time series of maximum values of cross-power spectrum phase between two consecutive speech frames
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
CN103117059A (en) Voice signal characteristics extracting method based on tensor decomposition
CN101221762A (en) MP3 compression field audio partitioning method
CN108682432B (en) Speech emotion recognition device
CN110647656B (en) Audio retrieval method utilizing transform domain sparsification and compression dimension reduction
CN112071308A (en) Awakening word training method based on speech synthesis data enhancement
CN110570870A (en) Text-independent voiceprint recognition method, device and equipment
CN107103913B (en) Speech recognition method based on power spectrum Gabor characteristic sequence recursion model
CN112786054A (en) Intelligent interview evaluation method, device and equipment based on voice and storage medium
CN116884431A (en) CFCC (computational fluid dynamics) feature-based robust audio copy-paste tamper detection method and device
Akdeniz et al. Linear prediction coefficients based copy-move forgery detection in audio signal
CN114937459A (en) Hierarchical fusion audio data enhancement method and system
CN111402898B (en) Audio signal processing method, device, equipment and storage medium
Patil et al. Content-based audio classification and retrieval: A novel approach
Aurchana et al. Musical instruments sound classification using GMM
CN112309404A (en) Machine voice identification method, device, equipment and storage medium
Yue et al. Speaker age recognition based on isolated words by using SVM
Therese et al. A linear visual assessment tendency based clustering with power normalized cepstral coefficients for audio signal recognition system
CN110634473A (en) Voice digital recognition method based on MFCC

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination