CN107657962A - The gentle sound identification of larynx sound and separation method and the system of a kind of voice signal - Google Patents

The gentle sound identification of larynx sound and separation method and the system of a kind of voice signal Download PDF

Info

Publication number
CN107657962A
CN107657962A CN201710692892.XA CN201710692892A CN107657962A CN 107657962 A CN107657962 A CN 107657962A CN 201710692892 A CN201710692892 A CN 201710692892A CN 107657962 A CN107657962 A CN 107657962A
Authority
CN
China
Prior art keywords
amplitude spectrum
amplitude
fourier transform
window
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710692892.XA
Other languages
Chinese (zh)
Other versions
CN107657962B (en
Inventor
何庆祥
张巍
霍颖翔
冯镇业
滕少华
张子臻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201710692892.XA priority Critical patent/CN107657962B/en
Publication of CN107657962A publication Critical patent/CN107657962A/en
Application granted granted Critical
Publication of CN107657962B publication Critical patent/CN107657962B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Prostheses (AREA)

Abstract

The invention discloses a kind of gentle sound identification of the larynx sound of voice signal to include with separation method and system, method:Obtain unimodal test mask function and reduction mask function;Folded window Fast Fourier Transform (FFT) is carried out to the voice signal of input, obtains the amplitude spectrum after Fourier transformation;Determine the fundamental frequency of the amplitude spectrum after each window signal is fourier transformed;Magnitude peak of the amplitude spectrum after Fourier transformation at positive integer times fundamental frequency is calculated according to the fundamental frequency of unimodal test mask function and measure;The envelope of larynx sound component is obtained according to the magnitude peak of calculating;The amplitude spectrum of larynx sound component is calculated according to reduction mask function and the magnitude peak calculated;The amplitude spectrum of gas sound component is calculated according to the amplitude spectrum of the amplitude spectrum after Fourier transformation and larynx sound component;The envelope of gas sound component is obtained according to the amplitude spectrum of gas sound component.Invention introduces unimodal test mask function and reduction mask function, it can identify and separate the gentle sound of larynx sound in voice signal, can be widely applied to field of signal processing.

Description

Method and system for identifying and separating throat sound and gas sound of voice signal
Technical Field
The invention relates to the field of signal processing, in particular to a method and a system for identifying and separating throat sound and gas sound of a voice signal.
Background
The voice is the acoustic expression of language, is the most natural, most effective and most convenient means for human to communicate information, and is also a support for human thinking. The human beings begin to enter the information-oriented era, and the voice processing technology is researched by modern means, so that people can more effectively generate, transmit, store and acquire voice information, which has very important significance for promoting the development of society. In recent decades, the study of speech has brought scientists and engineers together, forming an important discipline: and (5) processing a voice signal. The speech signal processing technology is called speech processing for short, and is closely connected with the subjects of linguistics, phonetics, psychology, acoustics, computer science, artificial intelligence and the like, so that the scientific and technological progress of the society is greatly promoted, and the handwritten manuscript and the manually printed text can be changed into the operation of an automatic dictation machine by using an automatic speech recognition technology; the operation of manually looking up various written text data is changed into the operation of automatically looking up various databases by spoken calls; the speech synthesis technology can be adopted to convert the stored speech or text data into speech with high quality for playback, and even automatically translate the speech or text data into speech of another language for playback or text display. In summary, the research on the speech signal processing technology is of great importance to the development of the information society.
Since the vibration sound (abbreviated as throat sound) emitted from the throat part and the gas sound (abbreviated as gas sound) emitted from the lip and tooth part in the voice have obviously different characteristics, the throat sound and the gas sound need to be separately processed when subsequent operations such as audio compression are carried out. However, throat sounds and air sounds cannot be separated from voice signals at present, and how to identify and separate the throat sounds and the air sounds in the voice signals becomes a technical problem to be solved urgently in the industry.
Disclosure of Invention
To solve the above technical problems, the present invention aims to: a method for recognizing and separating throat sound and gas sound of a voice signal is provided.
Another object of the present invention is to: a system for the recognition and separation of the laryngeal and acoustic speech of a speech signal is provided.
The technical scheme adopted by the invention is as follows:
a method for identifying and separating throat sound and gas sound of a voice signal comprises the following steps:
obtaining a unimodal test mask function and a reduction mask function;
carrying out window-overlapping fast Fourier transform on an input voice signal to obtain an amplitude spectrum after Fourier transform;
measuring the fundamental frequency of the amplitude spectrum of each window signal after Fourier transform;
calculating the amplitude peak value of the amplitude spectrum after Fourier transform at the positive integral multiple fundamental frequency according to the unimodal test mask function and the measured fundamental frequency;
obtaining the envelope of the throat sound component according to the calculated amplitude peak value;
calculating an amplitude spectrum of the throat sound component according to the reduction mask function and the calculated amplitude peak value;
calculating the amplitude spectrum of the aeroacoustic component according to the amplitude spectrum after Fourier transform and the amplitude spectrum of the throat acoustic component;
and obtaining the envelope of the gas-sound component according to the amplitude spectrum of the gas-sound component.
Further, the step of obtaining the unimodal test mask function and the reduction mask function specifically includes:
calculating a mask parameter eta of the frequency domain response characteristic corresponding to the normal distribution window function with the standard deviation sigma, and further determining a mask template function L (x, omega, sigma),x is frequencyThe frequency position of the amplitude spectrum on the domain, omega is the angular velocity, A is the amplitude;
obtaining a unimodal test mask function and a reduction mask function from the mask template function L (x, ω, σ), the unimodal test mask function V k (j,f i σ) and a reduction mask function U k (j,f i σ) are respectively:
where s is the sampling rate, f i Is the fundamental frequency of the ith window in the amplitude spectrum after Fourier transform, c is the window length of the fast Fourier transform, j is the mask template function L (c) -1 sj,2kπf i σ) corresponds to the jth frequency component in the Fourier transformed amplitude spectrum, k representing the k-fold fundamental frequency of the Fourier transformed amplitude spectrum,and k is a positive integer.
Further, the step of performing window-overlapping fast fourier transform on the input speech signal to obtain a fourier transformed amplitude spectrum specifically includes:
inputting a voice signal;
multiplying the ith window of the input voice signal by a normal distribution window function to obtain a signal after the multiplication of the window;
performing fast Fourier transform on the signal after window multiplication to obtain an amplitude spectrum after Fourier transformWherein the content of the first and second substances,is composed ofThe jth component in the frequency direction of (c),the ith window of the input voice signal is the amplitude spectrum after the fast Fourier transform.
Further, the step of calculating the amplitude peak value of the amplitude spectrum after fourier transform at the positive integer multiple fundamental frequency according to the unimodal test mask function and the determined fundamental frequency specifically comprises:
performing dot product operation on the single-peak test mask function and the amplitude spectrum after Fourier transform to obtain an amplitude peak value of the amplitude spectrum after Fourier transform at a positive integer multiple fundamental frequency, wherein a calculation formula of the amplitude peak value of the amplitude spectrum after Fourier transform at the positive integer multiple fundamental frequency is as follows:
wherein the content of the first and second substances,is composed ofThe peak in amplitude at k times the fundamental frequency,&lt, vector 1, vector 2&gt, representing the dot product operation of vector 1 and vector 2 as a point.
Further, the step of obtaining an envelope of the larynx sound component according to the calculated amplitude peak value specifically includes:
and performing curve fitting on the calculated amplitude peak value to obtain the envelope of the throat sound component, wherein the curve fitting method comprises a segmented polynomial fitting method.
Further, the step of calculating the amplitude spectrum of the larynx sound components according to the reduction mask function and the calculated amplitude peak value specifically comprises:
calculating the amplitude spectrum of the throat sound component according to the reduction mask function and the calculated amplitude peak value, wherein the calculation formula of the amplitude spectrum of the throat sound component is as follows:
wherein the content of the first and second substances,is composed ofThe jth component in the frequency direction of (c),and fitting the ith window of the input voice signal by a curve to obtain the corresponding amplitude spectrum of the larynx sound component.
Further, the step of calculating the amplitude spectrum of the photoacoustic component according to the amplitude spectrum after fourier transform and the amplitude spectrum of the laryngeal sound component specifically comprises:
calculating the amplitude spectrum of the aeroacoustic component according to the amplitude spectrum after Fourier transform and the amplitude spectrum of the throat acoustic component, wherein the calculation formula of the amplitude spectrum of the aeroacoustic component is as follows:
wherein the content of the first and second substances,the | represents an absolute value and is an amplitude spectrum of an air sound component corresponding to the ith window of the input voice signal.
Further, the step of obtaining the envelope of the gas-acoustic component according to the amplitude spectrum of the gas-acoustic component specifically includes:
amplitude spectrum of the aeroacoustic componentPerforming a Gaussian blur process with standard deviation phi to obtain the signal corresponding to the ith windowThe envelope of the aero-acoustic component, wherein,
the other technical scheme adopted by the invention is as follows:
a system for larynx and voice and gas recognition and separation of speech signals comprising:
the mask function acquisition module is used for acquiring a unimodal test mask function and a reduction mask function;
the window-folding fast Fourier transform module is used for carrying out window-folding fast Fourier transform on the input voice signal to obtain an amplitude spectrum after Fourier transform;
the fundamental frequency measuring module is used for measuring the fundamental frequency of the amplitude spectrum of each window signal after Fourier transform;
the amplitude peak value calculation module is used for calculating the amplitude peak value of the amplitude spectrum after Fourier transform at the positive integral multiple fundamental frequency according to the unimodal test mask function and the determined fundamental frequency;
the larynx sound component envelope calculation module is used for obtaining the envelope of the larynx sound component according to the calculated amplitude peak value;
the throat sound component amplitude spectrum calculation module is used for calculating the amplitude spectrum of the throat sound component according to the reduction mask function and the calculated amplitude peak value;
the aeroacoustic component amplitude spectrum calculation module is used for calculating the amplitude spectrum of the aeroacoustic component according to the amplitude spectrum after Fourier transform and the amplitude spectrum of the laryngeal acoustic component;
and the gas-sound component envelope calculation module is used for obtaining the envelope of the gas-sound component according to the amplitude spectrum of the gas-sound component.
Further, the mask function obtaining module includes:
a mask template function determining unit for calculating a mask parameter η corresponding to a frequency domain response characteristic when a normal distribution window function having a standard deviation σ is used, and further determining a mask template function L (x, ω, σ), x is the frequency position of the amplitude spectrum in the frequency domain, omega is the angular velocity, A is the amplitude;
a unimodal test mask function and reduction mask function obtaining unit for obtaining a unimodal test mask function and a reduction mask function from the mask template function L (x, ω, σ), the unimodal test mask function V k (j,f i σ) and a reduction mask function U k (j,f i σ) are respectively:
where s is the sampling rate, f i Is the fundamental frequency of the ith window in the amplitude spectrum after Fourier transform, c is the window length of the fast Fourier transform, j is the mask template function L (c) -1 sj,2kπf i σ) corresponds to the jth frequency component in the Fourier transformed amplitude spectrum, k representing the k-fold fundamental frequency of the Fourier transformed amplitude spectrum,and k is a positive integer.
The method of the invention has the beneficial effects that: a unimodal test mask function and a reduction mask function are introduced, the amplitude peak value of the amplitude spectrum at the positive integral multiple base frequency is determined according to the unimodal test mask function and the base frequency, then the amplitude spectrum of the larynx sound component is calculated by combining the reduction mask function, finally the amplitude spectrum of the larynx sound component is obtained according to the total amplitude spectrum and the amplitude spectrum of the larynx sound component, the larynx sound and the larynx sound can be recognized and separated from the voice signal, and the processing operations of compression, recognition, modification, noise reduction, optimization and the like of the voice signal can be facilitated.
The system of the invention has the advantages that: a unimodal test mask function and a reduction mask function are introduced into a mask function acquisition module, an amplitude peak value of an amplitude spectrum at positive integral multiple fundamental frequency is determined in an amplitude peak value calculation module according to the unimodal test mask function and the fundamental frequency, then the amplitude spectrum of the throat sound component is calculated in a throat sound component amplitude spectrum calculation module by combining the reduction mask function, finally, the amplitude spectrum of the throat sound component is obtained in an air sound component amplitude spectrum calculation module according to the total amplitude spectrum and the amplitude spectrum of the throat sound component, the throat sound and the air sound can be identified and separated from the voice signal, and the subsequent processing operations such as compression, identification, modification, noise reduction, optimization and the like can be favorably carried out on the voice signal.
Drawings
FIG. 1 is a flow chart of a method for identifying and separating throat and voice of a speech signal according to the present invention;
FIG. 2 is a block diagram of a system for recognizing and separating throat sounds and gas sounds of a speech signal according to the present invention;
FIG. 3 is a flowchart of a first embodiment of the present invention;
FIG. 4 is a waveform diagram of a sinusoidal component D of a speech signal E according to an embodiment of the present invention;
FIG. 5 is a diagram of a standard normal distribution window function according to a first embodiment of the present invention;
FIG. 6 is a diagram illustrating signals after window multiplication according to a first embodiment of the present invention;
FIG. 7 is a mask stencil function diagram according to an embodiment of the present invention;
fig. 8 is a schematic diagram illustrating a relationship between a larynx component vector and an air-sound component vector of an amplitude spectrum corresponding to an input speech signal according to an embodiment of the present invention.
Detailed Description
Referring to fig. 1, a method for recognizing and separating throat sound and gas sound of a voice signal includes the following steps:
obtaining a unimodal test mask function and a reduction mask function;
carrying out window-overlapping fast Fourier transform on an input voice signal to obtain an amplitude spectrum after Fourier transform;
measuring the fundamental frequency of the amplitude spectrum of each window signal after Fourier transform;
calculating the amplitude peak value of the amplitude spectrum after Fourier transform at the positive integral multiple fundamental frequency according to the unimodal test mask function and the measured fundamental frequency;
obtaining the envelope of the larynx sound component according to the calculated amplitude peak value;
calculating an amplitude spectrum of the throat sound component according to the reduction mask function and the calculated amplitude peak value;
calculating the amplitude spectrum of the aeroacoustic component according to the amplitude spectrum after Fourier transform and the amplitude spectrum of the throat acoustic component;
and obtaining the envelope of the gas-sound component according to the amplitude spectrum of the gas-sound component.
The envelope of the larynx sound component comprises amplitude information of the larynx sound component, and the envelope of the qi sound component comprises the amplitude information of the qi sound component, so that subsequent operations such as compression, envelope detection and the like can be conveniently carried out on the voice signal.
The window-folding fast fourier transform requires window-folding processing of the input speech signal to obtain each window signal, and then the corresponding amplitude spectrum is obtained by fast fourier transform. Therefore, each window signal is each window of the input voice signal, and the amplitude spectrum of each window signal after fourier transform is the amplitude spectrum obtained by the input voice signal through the window-overlapping fast fourier transform.
Further as a preferred embodiment, the step of obtaining the unimodal test mask function and the reduction mask function specifically includes:
calculating a mask parameter eta of the corresponding frequency domain response characteristic when a normal distribution window function with standard deviation sigma is used, and further determining a mask template function L (x, omega, sigma),x is the frequency position of the amplitude spectrum in the frequency domain, omega is the angular velocity, A is the amplitude;
obtaining a sheet from a mask stencil function L (x, ω, σ)A peak test mask function and a reduction mask function, the unimodal test mask function V k (j,f i σ) and a reduction mask function U k (j,f i σ) are respectively:
where s is the sampling rate, f i Is the fundamental frequency of the ith window in the amplitude spectrum after Fourier transform, c is the window length of the fast Fourier transform, j is the mask template function L (c) -1 sj,2kπf i σ) corresponds to the jth frequency component in the Fourier transformed amplitude spectrum, k representing the k-fold fundamental frequency of the Fourier transformed amplitude spectrum,and k is a positive integer.
Standard deviation of normal distribution window function of the inventionγ is the lower limit of the fundamental frequency value of the amplitude spectrum after fourier transform. In order to reduce the mutual influence between the peaks in the fourier transformed amplitude spectrum to a negligible degree, η should satisfy:
further as a preferred embodiment, the step of performing window-overlapping fast fourier transform on the input speech signal to obtain a fourier transformed amplitude spectrum specifically includes:
inputting a voice signal;
multiplying the ith window of the input voice signal by a normal distribution window function to obtain a signal after the multiplication of the window;
performing fast Fourier transform on the signal after window multiplication to obtain an amplitude spectrum after Fourier transformWherein, the first and the second end of the pipe are connected with each other,is composed ofThe jth component in the frequency direction of (c),the ith window of the input speech signal is the amplitude spectrum after fast Fourier transform.
Wherein, the normal distribution window function can adopt standard normal distribution window functiont is time.
Further as a preferred embodiment, the step of calculating the amplitude peak value of the fourier transformed amplitude spectrum at the positive integer multiple fundamental frequency according to the unimodal test mask function and the determined fundamental frequency is specifically:
performing dot product operation on the single-peak test mask function and the amplitude spectrum after Fourier transform to obtain an amplitude peak value of the amplitude spectrum after Fourier transform at a positive integer multiple fundamental frequency, wherein a calculation formula of the amplitude peak value of the amplitude spectrum after Fourier transform at the positive integer multiple fundamental frequency is as follows:
wherein the content of the first and second substances,is composed ofThe peak in amplitude at k times the fundamental frequency,<vector 1, vector 2&gt, represents the dot product operation of vector 1 and vector 2.
Further as a preferred embodiment, the step of obtaining the envelope of the larynx sound component according to the calculated amplitude peak value specifically includes:
and performing curve fitting on the calculated peak value of the amplitude value to obtain the envelope of the larynx sound components, wherein the curve fitting method comprises a segmented polynomial fitting method.
Further as a preferred embodiment, the step of calculating the amplitude spectrum of the throat sound component according to the reduction mask function and the calculated amplitude peak specifically includes:
calculating the amplitude spectrum of the throat sound component according to the reduction mask function and the calculated amplitude peak value, wherein the calculation formula of the amplitude spectrum of the throat sound component is as follows:
wherein the content of the first and second substances,is composed ofThe jth component in the frequency direction of (c),and fitting the ith window of the input voice signal by a curve to obtain the corresponding amplitude spectrum of the larynx sound component.
Further as a preferred embodiment, the step of calculating the amplitude spectrum of the photoacoustic component according to the amplitude spectrum after fourier transform and the amplitude spectrum of the laryngeal sound component specifically includes:
calculating the amplitude spectrum of the aeroacoustic component according to the amplitude spectrum after Fourier transform and the amplitude spectrum of the throat acoustic component, wherein the calculation formula of the amplitude spectrum of the aeroacoustic component is as follows:
wherein the content of the first and second substances,the | represents an absolute value and is an amplitude spectrum of an air sound component corresponding to the ith window of the input voice signal.
Further as a preferred embodiment, the step of obtaining the envelope of the gas-acoustic component according to the amplitude spectrum of the gas-acoustic component specifically includes:
to the amplitude spectrum of the aeroacoustic componentPerforming a Gaussian blur process with standard deviation phi to obtain an envelope of the aeroacoustic component corresponding to the ith window of the input voice signal,
referring to fig. 2, a system for recognizing and separating throat sound and gas sound of a voice signal includes:
the mask function acquisition module is used for acquiring a unimodal test mask function and a reduction mask function;
the window-folding fast Fourier transform module is used for carrying out window-folding fast Fourier transform on the input voice signal to obtain an amplitude spectrum after Fourier transform;
the fundamental frequency measuring module is used for measuring the fundamental frequency of the amplitude spectrum of each window signal after Fourier transform;
the amplitude peak value calculation module is used for calculating the amplitude peak value of the amplitude spectrum after Fourier transform at the positive integral multiple fundamental frequency according to the unimodal test mask function and the determined fundamental frequency;
the larynx sound component envelope calculation module is used for obtaining the envelope of the larynx sound component according to the calculated amplitude peak value;
the throat sound component amplitude spectrum calculation module is used for calculating the amplitude spectrum of the throat sound component according to the reduction mask function and the calculated amplitude peak value;
the aeroacoustic component amplitude spectrum calculation module is used for calculating the amplitude spectrum of the aeroacoustic component according to the amplitude spectrum after Fourier transform and the amplitude spectrum of the throat acoustic component;
and the gas-sound component envelope calculation module is used for obtaining the envelope of the gas-sound component according to the amplitude spectrum of the gas-sound component.
Referring to fig. 2, further as a preferred embodiment, the mask function obtaining module includes:
a mask template function determining unit for calculating a mask parameter η of the frequency domain response characteristic corresponding to the normal distribution window function having a standard deviation σ, and determining a mask template function L (x, ω, σ), x is the frequency position of the amplitude spectrum in the frequency domain, omega is the angular velocity, A is the amplitude;
a unimodal test mask function and reduction mask function obtaining unit for obtaining a unimodal test mask function and a reduction mask function from the mask template function L (x, ω, σ), the unimodal test mask function V k (j,f i σ) and a reduction mask function U k (j,f i σ) are respectively:
where s is the sampling rate, f i Is the fundamental frequency of the ith window in the Fourier transformed amplitude spectrum, c is the window length of the fast Fourier transform, j is the mask template function L (c) -1 sj,2kπf i σ) corresponds to the jth frequency component in the Fourier transformed amplitude spectrum, k representing the k-fold fundamental frequency of the Fourier transformed amplitude spectrum,and k is a positive integer.
The invention is further explained and illustrated in the following description with reference to the figures and the specific embodiments thereof.
Example one
Aiming at the problem that the throat sound and the air sound can not be separated from the voice signal in the prior art, the invention provides a novel method for identifying and separating the throat sound and the air sound of the voice signal. According to the method, after an input voice signal is subjected to window multiplication by using a normal distribution function with standard deviation of sigma, fourier transform is performed to obtain an amplitude spectrum after Fourier transform, and a unimodal test mask function and a reduction mask function are matched to perform a series of processing operations to finally obtain the envelope of the larynx acoustic component and the envelope of the qi-sound component, so that subsequent processing operations such as compression, identification, modification, noise reduction, optimization and the like of the voice are facilitated.
The method of the present invention requires a window-folding process of the input speech signal, each window having the same length and being c, whereins is the sampling rate and gamma is the lower limit of the fundamental frequency value of the method. The methods described below are all processed for a certain (e.g., ith) window of the input speech signal. Referring to fig. 3, the method for recognizing and separating throat sound and gas sound of a voice signal of the invention comprises the following steps:
s1, calculating a corresponding frequency domain response characteristic mask parameter eta when a normal distribution window with standard deviation sigma is used, further determining a mask template function L (x, omega, sigma), and deducing a unimodal test mask function V according to the mask parameter eta k (j,f i σ) and a reduction mask function U k (j,f i ,σ);
S2, inputting a voice signal, and obtaining an amplitude spectrum after window-overlapping FFT (fast Fourier transform) of the voice signalWherein i represents the window number, and j represents the jth component of the window signal in the frequency direction;
s3, measuring the fundamental frequency of each window signal, and recording as f i
S4, adding V k (j,f i σ) andby dot product, obtainAmplitude peak at k times fundamental frequency
S5, amplitude peak value is obtained through pairPerforming curve fitting to obtain the envelope of the larynx sound component;
s6, setting the peak position as U of k times fundamental frequency k (j,f i σ) andcalculating to obtain amplitude spectrum of corresponding window throat sound componentWhereinAnd k is a positive integer;
s7, calculating an amplitude spectrum of the gas acoustic component
And S8, calculating the envelope of the gas sound component.
With reference to fig. 4, 5, 6 and 7, the step S1 can be further refined into the following steps:
s11, η = η (σ) is expressed asNoting the peak interval asIn order to reduce the mutual influence between the peaks to a negligible degree, the method should satisfyThus, it is possible to provideThen according to the expressionCan be pushed outAnd according to the 3 sigma law of the normal distribution function, c is more than or equal to 6 sigma, and the result is obtainedTo sum up, there are
To satisfyIs obtained by pushingAnd because ofIs pushed toTo sum up, there are
S12, assuming that a speech signal E exists, D is a sinusoidal component of EThe waveform of D is shown in fig. 4, where ω is the corresponding angular velocity,is the corresponding phase angle, a is the corresponding amplitude;
s13, selecting a standard normal distribution window function of FFT asDenoted as W (t, σ), as shown in fig. 5;
s14, multiplying D in step S12 and W (t, σ) in step S13 to obtain a signal after window multiplication as shown in fig. 6, where the signal after window multiplication can be expressed as:
the formula (1) is represented asFast fourier transform is performed on equation (1) to obtain the amplitude of the frequency domain as shown in fig. 7, which is expressed as:
from equation (2) above, when σ is taken, the amplitude of this frequency domain will only shift along the x-axis with changes in ω and scale longitudinally with changes in a, but its basic shape properties remain unchanged. Therefore, L (, ω, σ) can be used as a mask template function to represent the behavior of all time domain components with amplitude a and angular velocity ω in the frequency domain.
S15, the unimodal test mask function V can be deduced through L (x, omega, sigma) in the invention k (j,f i σ), the specific derivation procedure is as follows:
the invention assumes that the larynx sound signal and the gas sound signal are in orthogonal relation (namely the larynx sound signal and the gas sound signal are in additive relation) in calculation. Based on the above assumptions, as shown in FIG. 8, the corresponding amplitude spectrum of the input speech signalOrthogonally decomposable into laryngeal component vectorsAnd the vector of the photoacoustic componentsAnd according to the vector L (x, omega, sigma), the larynx sound component vectorShould coincide with the direction of L (x, ω, σ), so it can be calculatedProjection on L (x, ω, σ) to estimate the larynx sound component vectorSince the undetermined value a exists in the formula of L (x, ω, σ), L (x, ω, σ) is first unitized and is denoted as L 1 (x, ω, σ), and then in turnAt L 1 Projection onto (x, ω, σ). As shown in figure 8 of the drawings,and L 1 The included angle of (x, omega, sigma) is theta, letAt L 1 The projection on (x, ω, σ) is t 1 *L 1 (x, ω, σ), then &lt, vector 1, vector 2&gt represents the dot product of vector 1 and vector 2, and because II 1 (x, ω, σ) | =1, so there is:therefore, it is possible toIs L 1 T of the (x, ω, σ) peak 1 And (4) doubling. So that in principle can passComputing vectorsAt L 1 Component t on (x, ω, σ) 1 *L 1 (x, ω, σ) to estimateIntermediate larynx sound component vectorAmplitude peak at k times fundamental frequencyWhere max (L) 1 (x, ω, σ)) represents L 1 (x, ω, σ) is the maximum value when ω, σ is determined;
the peak value calculation after dot product operation requires a large amount of calculation, so that a single-peak test mask function V can be constructed to reduce the amount of calculation k (j,f i σ) such thatCan directly pass throughComputingThus obtaining the product.
In conclusion, V k (j,f i σ) the estimation process is as follows:
s151, unitizing L (x, omega, sigma) and recording as L 1 (x, ω, σ), which is expressed as:
s152, when x = omega/2 pi, L 1 (x, ω, σ) is taken to a maximum value, i.e.Is composed of
S153, calculating
S154, the peak value is calculated back,
s155, the step S153Substituted into step S154, the calculation formula of the peak value can be expressed as
S156, converting the above formula (L) 1 (x,ω,σ)*max(L 1 (x, ω, σ))) is extracted and x = c is further extracted based on the conversion between frequency and position and the conversion between angular velocity and integer multiples of the fundamental frequency -1 sj and ω =2k π f i Substituting to obtain a unimodal test mask function V k (j,f i ,σ)=L 1 (c -1 sj,2kπf i ,σ)*max(L 1 (c -1 sj,2kπf i ,σ)));
S157, in conclusion, V k (j,f i σ) is simplified as:
s16, the method can also deduce a reduction mask function U through L (x, omega, sigma) k (j,f i σ), the specific derivation procedure is as follows: the invention calculates the amplitude spectrum of the throat sound component of the corresponding windowWhen the method is used, corresponding amplitude functions at positive integral multiple fundamental frequencies are required to be superposed; the corresponding amplitude function at the positive integral multiple fundamental frequency is composed of the amplitude peak value at the positive integral multiple fundamental frequencyAnd restoring the mask function U k (j,f i σ) product; therefore, the invention must ensure the reduction of the mask function U in calculation k (j,f i σ) has an amplitude peak value of 1; at this time, the method can set the peak position as U of k times fundamental frequency k (j,f i σ) andperforming operation (including multiplication and summation) to obtain amplitude spectrum of corresponding throat sound component
In summary, U k (j,f i σ) the estimation process is as follows:
as can be seen from step S14, the expression of L (x, ω, σ) is as follows:
when x = ω/2 π, L (x, ω, σ) takes a maximum valueTherefore, in order to restore the mask function U k (j,f i σ) has an amplitude peak of 1, and it is only necessary to multiply the mask stencil function L (x, ω, σ) byAnd (4) finishing.
And according to the conversion relation between the frequency and the position and the conversion relation between the angular speed and the integral multiple fundamental frequency, converting x = c -1 sj and ω =2k π f i Substitution, then U k (j,f i σ) can be expressed as:
the step S2 specifically includes:
s21, multiplying ith window of input voice signal by normal distribution window functionObtaining a signal after the window multiplication;
the window function used by the FFT has the characteristics of small two ends and large middle, and the two ends of the general window function are close to 0 value, so the window function of the FFT can be regarded as a window using infinite length and a window function of which two sides are infinitely close to 0 approximately. Standard normal distribution window function selected by the methodIs simple enough to facilitate the following integration operation.
S22, performing Fast Fourier Transform (FFT) on the signal after the window multiplication to obtain an amplitude spectrum after the FFTWherein, the first and the second end of the pipe are connected with each other,is composed ofThe jth component in the frequency direction of (c),the ith window of the input speech signal is the amplitude spectrum after fast Fourier transform.
Since the window length can be regarded as infinity approximately, now an infinite length window is used to calculate the amplitude spectrum after fast fourier transform, and the amplitude corresponding to the frequency position x in the spectrum is recorded as f (x), and the specific calculation formula of f (x) is:
to simplify the calculation, f () can be approximately converted to an integral form by treating the discrete signal approximation in the above equation as a continuous signal, resulting in g (x):
when x >0, the integral in step g (x) is solved, and simplified to obtain:
as can be seen by splitting g (x) in the above formula,andboth are low at both sides and high in the middle; furthermore, in the case of a liquid crystal display,the peak of (a) occurs at x = -omega/2 pi, andthe peak of (d) occurs at x =/2 pi. Therefore, according to law 3, ifThen x is greater than or equal to 0 in g (x)Will be small and approximately negligible. Therefore, at this time, g ():
from equation (4), when σ is taken, this function only produces translation along the x-axis with the change of ω and longitudinal scaling with the transformation of a, but its basic shape property remains unchanged, so L (x, ω, σ) of equation (4) can be used as a mask template function to represent the behavior of all time-domain components with amplitude a and angular velocity ω in the frequency domain.
In the step S3, let γ be the lower limit of the fundamental frequency value in the method, and let λ be the upper limit of the fundamental frequency value in the method, then the fundamental frequency f measured by the method i ∈[γ,λ]。
The step S4 specifically includes:
by mixing V k (j,f i σ) andby taking dot product to obtainAmplitude peak at k times fundamental frequencyWherein the peak value of the amplitudeThe calculation formula of (2) is as follows:
according to step S15, there are:
the two formulas are combined, and the following steps can be obtained:
the curve fitting in step S5 may be a piecewise polynomial fitting.
The step S6 specifically includes:
setting peak position as U of k times fundamental frequency k (j,f i σ) andcalculating to obtain amplitude spectrum of corresponding window throat sound componentThe fitting calculation formula of (a) is as follows:
whereinAnd k is a positive integer;
according to step S16, there are:
the two formulas are combined, and can be obtained:
the step S8 specifically includes:
measuring the amplitude spectrum of the gas-acoustic componentConvolving with a normal distribution function with standard deviation phi to obtain an envelope of the aeroacoustic component, wherein,while
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A method for recognizing and separating throat sound and gas sound of a voice signal is characterized in that: the method comprises the following steps:
obtaining a unimodal test mask function and a reduction mask function;
carrying out window-overlapping fast Fourier transform on an input voice signal to obtain an amplitude spectrum after Fourier transform;
measuring the fundamental frequency of the amplitude spectrum of each window signal after Fourier transform;
calculating the amplitude peak value of the amplitude spectrum after Fourier transform at the positive integral multiple fundamental frequency according to the unimodal test mask function and the measured fundamental frequency;
obtaining the envelope of the throat sound component according to the calculated amplitude peak value;
calculating an amplitude spectrum of the throat sound component according to the reduction mask function and the calculated amplitude peak value;
calculating the amplitude spectrum of the aeroacoustic component according to the amplitude spectrum after Fourier transform and the amplitude spectrum of the throat acoustic component;
and obtaining the envelope of the gas-sound component according to the amplitude spectrum of the gas-sound component.
2. A method of throat and voice recognition and separation of a speech signal according to claim 1, characterized by: the step of obtaining the unimodal test mask function and the reduction mask function specifically comprises:
calculating a mask parameter eta of the frequency domain response characteristic corresponding to the normal distribution window function with the standard deviation sigma, and further determining a mask template function L (x, omega, sigma),x is the frequency position of the amplitude spectrum in the frequency domain, omega is the angular velocity, A is the amplitude;
obtaining a unimodal test mask function and a reduction mask function from the mask template function L (x, ω, σ), the unimodal test mask function V k (j,f i σ) and a reduction mask function U k (j,f i σ) are respectively:
where s is the sampling rate, f i Is the fundamental frequency of the ith window in the amplitude spectrum after Fourier transform, c is the window length of the fast Fourier transform, j is the mask template function L (c) -1 sj,2kπf i σ) corresponds to the jth frequency component in the Fourier transformed amplitude spectrum, k representing the k-fold fundamental frequency of the Fourier transformed amplitude spectrum,and k is a positive integer.
3. A method of throat and voice recognition and separation of a speech signal according to claim 2, characterized in that: the step of performing window-overlapping fast fourier transform on the input voice signal to obtain a fourier-transformed amplitude spectrum specifically includes:
inputting a voice signal;
multiplying the ith window of the input voice signal by a normal distribution window function to obtain a signal after the multiplication of the window;
performing fast Fourier transform on the signal after window multiplication to obtain an amplitude spectrum after Fourier transformWherein, the first and the second end of the pipe are connected with each other,is composed ofThe jth component in the frequency direction of (c),the ith window of the input speech signal is the amplitude spectrum after fast Fourier transform.
4. A method of throat and voice recognition and separation of a speech signal according to claim 3, characterized by: the step of calculating the amplitude peak value of the amplitude spectrum after Fourier transform at the positive integral multiple base frequency according to the unimodal test mask function and the determined base frequency specifically comprises the following steps:
performing dot product operation on the single-peak test mask function and the amplitude spectrum after Fourier transformation to obtain the amplitude peak value of the amplitude spectrum after Fourier transformation at the positive integral multiple base frequency, wherein the calculation formula of the amplitude peak value of the amplitude spectrum after Fourier transformation at the positive integral multiple base frequency is as follows:
wherein, the first and the second end of the pipe are connected with each other,is composed ofThe peak in amplitude at k times the fundamental frequency,&lt, vector 1, vector 2&gt, represents the dot product operation of vector 1 and vector 2.
5. The method of claim 4, wherein the method comprises the steps of: the step of obtaining the envelope of the throat sound component according to the calculated amplitude peak value specifically comprises:
and performing curve fitting on the calculated amplitude peak value to obtain the envelope of the throat sound component, wherein the curve fitting method comprises a segmented polynomial fitting method.
6. The method of claim 5, wherein the method comprises the steps of: the step of calculating the amplitude spectrum of the throat sound component according to the reduction mask function and the calculated amplitude peak value specifically comprises:
calculating the amplitude spectrum of the throat sound component according to the reduction mask function and the calculated amplitude peak value, wherein the calculation formula of the amplitude spectrum of the throat sound component is as follows:
wherein, the first and the second end of the pipe are connected with each other,is composed ofThe jth component in the frequency direction of (c),and fitting the ith window of the input voice signal by a curve to obtain the corresponding amplitude spectrum of the larynx sound component.
7. The method of claim 6, wherein the method comprises the steps of: the step of calculating the amplitude spectrum of the aeroacoustic component according to the amplitude spectrum after Fourier transform and the amplitude spectrum of the throat acoustic component specifically comprises the following steps:
calculating the amplitude spectrum of the aeroacoustic component according to the amplitude spectrum after Fourier transform and the amplitude spectrum of the throat acoustic component, wherein the calculation formula of the amplitude spectrum of the aeroacoustic component is as follows:
wherein the content of the first and second substances,the | represents an absolute value and is an amplitude spectrum of an air sound component corresponding to the ith window of the input voice signal.
8. The method of claim 7, wherein the method comprises the steps of: the step of obtaining the envelope of the gas-sound component according to the amplitude spectrum of the gas-sound component specifically comprises:
amplitude spectrum of the aeroacoustic componentPerforming a Gaussian blur process with standard deviation phi to obtain an envelope of the aeroacoustic component corresponding to the ith window of the input voice signal,
9. a system for recognizing and separating throat and gas sounds of a speech signal, characterized in that: the method comprises the following steps:
the mask function acquisition module is used for acquiring a unimodal test mask function and a reduction mask function;
the window-folding fast Fourier transform module is used for carrying out window-folding fast Fourier transform on the input voice signal to obtain an amplitude spectrum after Fourier transform;
the fundamental frequency measuring module is used for measuring the fundamental frequency of the amplitude spectrum of each window signal after Fourier transform;
the amplitude peak value calculation module is used for calculating the amplitude peak value of the amplitude spectrum after Fourier transform at the positive integral multiple fundamental frequency according to the unimodal test mask function and the determined fundamental frequency;
the larynx sound component envelope calculation module is used for obtaining the envelope of the larynx sound component according to the calculated amplitude peak value;
the throat sound component amplitude spectrum calculation module is used for calculating the amplitude spectrum of the throat sound component according to the reduction mask function and the calculated amplitude peak value;
the aeroacoustic component amplitude spectrum calculation module is used for calculating the amplitude spectrum of the aeroacoustic component according to the amplitude spectrum after Fourier transform and the amplitude spectrum of the laryngeal acoustic component;
and the gas-sound component envelope calculation module is used for obtaining the envelope of the gas-sound component according to the amplitude spectrum of the gas-sound component.
10. A system for throat and voice recognition and separation of a speech signal according to claim 9, wherein: the mask function acquisition module includes:
a mask template function determining unit for calculating a mask parameter η of the frequency domain response characteristic corresponding to the normal distribution window function having a standard deviation σ, and determining a mask template function L (x, ω, σ), x is the frequency of the amplitude spectrum in the frequency domainPosition, ω is angular velocity, A is amplitude;
a unimodal test mask function and reduction mask function obtaining unit for obtaining a unimodal test mask function and a reduction mask function from the mask template function L (x, ω, σ), the unimodal test mask function V k (j,f i σ) and a reduction mask function U k (j,f i σ) are respectively:
where s is the sampling rate, f i Is the fundamental frequency of the ith window in the amplitude spectrum after Fourier transform, c is the window length of the fast Fourier transform, j is the mask template function L (c) -1 sj,2kπf i σ) corresponds to the jth frequency component in the Fourier transformed amplitude spectrum, k representing the k-fold fundamental frequency of the Fourier transformed amplitude spectrum,and k is a positive integer.
CN201710692892.XA 2017-08-14 2017-08-14 Method and system for identifying and separating throat sound and gas sound of voice signal Active CN107657962B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710692892.XA CN107657962B (en) 2017-08-14 2017-08-14 Method and system for identifying and separating throat sound and gas sound of voice signal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710692892.XA CN107657962B (en) 2017-08-14 2017-08-14 Method and system for identifying and separating throat sound and gas sound of voice signal

Publications (2)

Publication Number Publication Date
CN107657962A true CN107657962A (en) 2018-02-02
CN107657962B CN107657962B (en) 2020-06-12

Family

ID=61128485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710692892.XA Active CN107657962B (en) 2017-08-14 2017-08-14 Method and system for identifying and separating throat sound and gas sound of voice signal

Country Status (1)

Country Link
CN (1) CN107657962B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI665661B (en) * 2018-02-14 2019-07-11 美律實業股份有限公司 Audio processing apparatus and audio processing method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6850882B1 (en) * 2000-10-23 2005-02-01 Martin Rothenberg System for measuring velar function during speech
CN1672325A (en) * 2002-06-05 2005-09-21 索尼克焦点公司 Acoustical virtual reality engine and advanced techniques for enhancing delivered sound
CN102054480A (en) * 2009-10-29 2011-05-11 北京理工大学 Method for separating monaural overlapping speeches based on fractional Fourier transform (FrFT)
CN105096961A (en) * 2014-05-06 2015-11-25 华为技术有限公司 Voice separation method and device
CN105679331A (en) * 2015-12-30 2016-06-15 广东工业大学 Sound-breath signal separating and synthesizing method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6850882B1 (en) * 2000-10-23 2005-02-01 Martin Rothenberg System for measuring velar function during speech
CN1672325A (en) * 2002-06-05 2005-09-21 索尼克焦点公司 Acoustical virtual reality engine and advanced techniques for enhancing delivered sound
CN102054480A (en) * 2009-10-29 2011-05-11 北京理工大学 Method for separating monaural overlapping speeches based on fractional Fourier transform (FrFT)
CN105096961A (en) * 2014-05-06 2015-11-25 华为技术有限公司 Voice separation method and device
CN105679331A (en) * 2015-12-30 2016-06-15 广东工业大学 Sound-breath signal separating and synthesizing method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DALEI WU: ""the theory of compressive sensing matching pursuit considering time-domain noise with application to speech enhancement"", 《IEEE ACM TRANSACTIONS ON AUDIO,SPEECH,AND LANGUAGE PROCESSING 》 *
曹后斌: ""有色背景噪声环境下语音增强系统的设计与实现"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI665661B (en) * 2018-02-14 2019-07-11 美律實業股份有限公司 Audio processing apparatus and audio processing method

Also Published As

Publication number Publication date
CN107657962B (en) 2020-06-12

Similar Documents

Publication Publication Date Title
US10019998B2 (en) Detecting distorted audio signals based on audio fingerprinting
CN109074820B (en) Audio processing using neural networks
WO2020173133A1 (en) Training method of emotion recognition model, emotion recognition method, device, apparatus, and storage medium
US9355649B2 (en) Sound alignment using timing information
US7680660B2 (en) Voice analysis device, voice analysis method and voice analysis program
US20220343898A1 (en) Speech recognition method and apparatus, and computer-readable storage medium
US20160148620A1 (en) Indexing based on time-variant transforms of an audio signal&#39;s spectrogram
US20210193149A1 (en) Method, apparatus and device for voiceprint recognition, and medium
CN108962231B (en) Voice classification method, device, server and storage medium
WO2015139452A1 (en) Method and apparatus for processing speech signal according to frequency domain energy
JP5530812B2 (en) Audio signal processing system, audio signal processing method, and audio signal processing program for outputting audio feature quantity
US20140200889A1 (en) System and Method for Speech Recognition Using Pitch-Synchronous Spectral Parameters
CN107657962B (en) Method and system for identifying and separating throat sound and gas sound of voice signal
Ernawan et al. Efficient discrete tchebichef on spectrum analysis of speech recognition
Hanna et al. Speech recognition using Hilbert-Huang transform based features
US20230386492A1 (en) System and method for suppressing noise from audio signal
US20230377591A1 (en) Method and system for real-time and low latency synthesis of audio using neural networks and differentiable digital signal processors
CN110379438A (en) A kind of voice signal fundamental detection and extracting method and system
US20210356502A1 (en) Systems and methods of signal analysis and data transfer using spectrogram construction and inversion
US9196263B2 (en) Pitch period segmentation of speech signals
Ganapathy et al. Temporal resolution analysis in frequency domain linear prediction
TWI409802B (en) Method and apparatus for processing audio feature
CN106098080A (en) The determination method and device of speech discrimination threshold under a kind of noise circumstance
Li et al. A denosing method of frequency spectrum for recognition of dashboard sounds
CN111415674A (en) Voice noise reduction method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant