CN113205827B - High-precision extraction method and device for baby voice fundamental frequency and computer equipment - Google Patents

High-precision extraction method and device for baby voice fundamental frequency and computer equipment Download PDF

Info

Publication number
CN113205827B
CN113205827B CN202110487291.1A CN202110487291A CN113205827B CN 113205827 B CN113205827 B CN 113205827B CN 202110487291 A CN202110487291 A CN 202110487291A CN 113205827 B CN113205827 B CN 113205827B
Authority
CN
China
Prior art keywords
infant
voice data
framed
data
fundamental frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110487291.1A
Other languages
Chinese (zh)
Other versions
CN113205827A (en
Inventor
张茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhang Ping
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202110487291.1A priority Critical patent/CN113205827B/en
Publication of CN113205827A publication Critical patent/CN113205827A/en
Application granted granted Critical
Publication of CN113205827B publication Critical patent/CN113205827B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and a device for extracting a baby voice base frequency with high precision, which comprises the steps of obtaining baby voice data, and performing framing processing on the baby voice data according to a preset voice framing processing strategy to obtain framed baby voice data on a plurality of frame time domains; performing fast Fourier transform on the framed infant voice data in the time domain, and then taking absolute values to obtain framed infant voice data in the frequency domain; dividing the framed infant voice data on a frequency domain into a symmetrical first part and a symmetrical second part, defining the first part or the second part as an array Z, and taking the logarithm of the array Z according to a preset calculation strategy and recording the logarithm as ZLog; and calculating the autocorrelation coefficient of the ZLog, and acquiring the voice fundamental frequency of the infant voice data according to the autocorrelation coefficient of the ZLog. According to the method and the device for extracting the baby voice fundamental frequency with high precision, provided by the invention, the baby voice data is subjected to framing and conversion into a frequency domain, and autocorrelation coefficient calculation is carried out after logarithm is taken, so that the calculation precision is greatly improved.

Description

High-precision extraction method and device for baby voice fundamental frequency and computer equipment
Technical Field
The invention relates to the field of voice fundamental frequency detection and extraction, in particular to a method and a device for extracting a baby voice fundamental frequency with high precision and computer equipment.
Background
The speech processing of human voice is currently mostly aimed at the group of people who have already spoken. Generally speaking person recognition and speech recognition are focused on, and crying sounds like infants that tend to have emotional manifestations without corresponding words are too rarely studied. The general fundamental frequency identification method comprises an autocorrelation coefficient method, an average amplitude difference function method, a cepstrum coefficient method and the like, but the cepstrum coefficient method has overlarge requirements on resources. Because the fundamental frequency of the baby is high, the number of harmonic sounds is small, sometimes the first formant is too high, and the attenuation of the fundamental frequency is too large, errors are easy to occur when the baby fundamental frequency is identified by using an autocorrelation coefficient method and an average amplitude difference function method.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method for extracting a fundamental frequency of an infant speech with high precision, comprising: acquiring infant voice data, and performing framing processing on the infant voice data according to a preset voice framing processing strategy to obtain framed infant voice data on a plurality of frame time domains; performing fast Fourier transform on the framed infant voice data in the time domain, and then taking absolute values to obtain framed infant voice data in the frequency domain; dividing the framed infant voice data on a frequency domain into a symmetrical first part and a symmetrical second part from the middle position, defining the first part or the second part as an array Z, and taking the logarithm of the array Z according to a preset calculation strategy and recording the logarithm as ZLog; and calculating the autocorrelation coefficient of the Zlog according to a preset autocorrelation coefficient calculation strategy, and extracting the voice fundamental frequency of the infant voice data according to the autocorrelation coefficient of the Zlog and a preset voice fundamental frequency extraction strategy.
Further, framing the infant voice data according to a preset voice framing processing strategy includes: pre-emphasis processing is carried out on the infant voice data, so that the high-frequency resolution of the infant voice data is improved; and performing framing processing on the pre-emphasized infant voice data by using a Hamming window.
Further, before performing fast fourier transform on the framed infant speech data, taking an absolute value, and obtaining framed infant speech data in a frequency domain, the method further includes: the sine value and the cosine value of each data graduation in the framed infant voice data are calculated in advance, the sine value and the cosine value of each data graduation are stored into an array, and when the framed infant voice data are subjected to fast Fourier transform, the array is used for carrying out fast Fourier transform.
Further, the sampling frequency of the infant voice data per frame is 8820Hz, and the number of sampling points is 256.
Further, taking logarithm of the array Z according to a preset calculation strategy, and recording as Zlog includes: storing in advance a natural logarithm result n of m 1024:128:33664, wherein the 256 data of m 1024:128:33664 from 1024 to 128 in steps and ending at 33664 are sequentially represented as m0, m1, m2 … … m255, and the natural logarithm results of m0, m1 and m2 … … m255 are sequentially represented as n0, n1 and n2 … … n 255; using the formula ln (Z) ═ ln (Z × e)t) T, deforming ln (Z) such that Z=Z*et,ZAt [ m0, m255]Within the interval; determination of ZThe precise interval [ mq, mq +1 [ ]]Q is [0, 255]The whole number of (1); obtaining the natural logarithm result nq of mq according to the formula ln (Z))=nq+(Z-mq)/mq and the formula ln (Z) ═ ln (Z)) T, calculating ln (Z) as the result Zlog of the logarithm of the array Z.
Furthermore, after the fast Fourier transform is carried out on the framing infant voice data in the time domain, the absolute value is taken as the absolute value calculated by using a Newton iteration method.
The invention also provides a high-precision extraction device for the baby voice fundamental frequency, which comprises a preprocessing module, a fast Fourier transform module, a logarithm calculation module, an autocorrelation coefficient calculation module and a voice fundamental frequency calculation module, wherein: the preprocessing module is connected with the fast Fourier transform module and used for acquiring infant voice data, and framing the infant voice data according to a preset voice framing processing strategy to acquire framed infant voice data in a plurality of frame time domains; the fast Fourier transform module is connected with the logarithm calculation module and is used for carrying out fast Fourier transform on the framed infant voice data in the time domain and then taking an absolute value to obtain framed infant voice data in the frequency domain; the logarithm calculation module is connected with the autocorrelation coefficient calculation module and is used for dividing the framed infant voice data on the frequency domain into a symmetrical first part and a symmetrical second part from the middle position, defining the first part or the second part as an array Z, and taking the logarithm of the array Z according to a preset calculation strategy and recording the logarithm as ZLog; the autocorrelation coefficient calculation module is connected with the voice fundamental frequency calculation module and used for calculating the autocorrelation coefficient of the ZLog according to a preset autocorrelation coefficient calculation strategy; and the voice fundamental frequency calculation module is used for extracting the voice fundamental frequency of the infant voice data according to the autocorrelation coefficient of the Zlog and a preset voice fundamental frequency extraction strategy.
Further, the preprocessing module comprises a pre-emphasis unit and a framing unit connected with the pre-emphasis unit, wherein: the pre-emphasis unit is used for pre-emphasizing the infant voice data to improve the high-frequency resolution of the infant voice data; the framing unit is used for performing framing processing on the infant voice data subjected to pre-emphasis processing by utilizing a Hamming window to obtain framed infant voice data on a plurality of frames of time domains.
The device further comprises a data division value storage module, the data division value storage module is connected with the fast Fourier transform module and used for storing sine values and cosine values of all data divisions in the framed infant voice data into an array in advance, and the fast Fourier transform module is also used for performing fast Fourier transform on the framed infant voice data by using the array when performing the fast Fourier transform.
The invention also provides computer equipment which comprises a memory and a processor, wherein the memory stores computer programs, and the processor realizes the steps of the high-precision extraction method of the baby voice fundamental frequency when executing the computer programs.
The method, the device and the computer equipment for extracting the baby voice fundamental frequency with high precision provided by the invention at least have the following beneficial effects: when the voice fundamental frequency is extracted, infant voice data are firstly subjected to framing processing to obtain stable framed infant voice data, the framed infant voice data in a time domain are converted into framed infant voice data in a frequency domain, logarithm ZLog of the framed infant voice data in the frequency domain is obtained, namely, the concentration degree of frequency energy is purified once, and calculation accuracy is greatly improved after autocorrelation calculation, so that the error rate of the infant voice fundamental frequency extraction is reduced. The method is widely applicable to single-chip microcomputers and greatly improves the calculation speed.
Drawings
For a clearer explanation of the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts;
FIG. 1 is a flow chart of a method for extracting a base frequency of an infant speech with high precision according to an embodiment of the present invention;
FIG. 2 is a flow chart of a logarithm calculation method in an embodiment of the present invention;
FIG. 3 is a one-frame framed infant speech data atlas in the time domain in one embodiment of the invention;
FIG. 4 is a prior art map obtained after calculating autocorrelation coefficients for the image of FIG. 3;
FIG. 5 is a first portion of a frequency domain frame-by-frame infant speech data atlas obtained by performing a fast Fourier transform on the representation of FIG. 3, in accordance with an embodiment of the present invention;
FIG. 6 is a graph obtained by calculating autocorrelation coefficients after taking the logarithm of FIG. 5 in an embodiment in accordance with the present invention;
FIG. 7 is a schematic diagram of an apparatus for extracting fundamental frequency of infant speech with high precision according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a preprocessing module according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of an apparatus for extracting fundamental frequency of infant speech with high precision according to another embodiment of the present invention;
301-preprocessing module, 302-fast Fourier transform module, 303-logarithm calculation module, 304-autocorrelation coefficient calculation module, 305-voice fundamental frequency calculation module, 306-data division value storage module, 3011-pre-emphasis unit and 3012-framing unit.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In an embodiment of the present invention, as shown in fig. 1, a method for extracting a fundamental frequency of a baby voice with high precision is disclosed, specifically, the method includes the following steps:
step S101: the method comprises the steps of obtaining infant voice data, carrying out framing processing on the infant voice data according to a preset voice framing processing strategy, and obtaining framed infant voice data on a plurality of frame time domains.
Specifically, in this embodiment, the framing the infant voice data according to the preset voice framing processing policy includes: pre-emphasis processing is carried out on the infant voice data, so that the high-frequency resolution of the infant voice data is improved; the pre-emphasized infant voice data is subjected to framing processing by using a hamming window, as shown in fig. 3, a framed infant voice data map in a framed time domain is provided, wherein an abscissa represents the vibration amplitude of the voice signal, and an ordinate represents the number of sampling points.
Speech signals (i.e., baby speech data) are signals that change over time, and are largely classified into voiced and unvoiced sounds. The pitch period of voiced sounds, the voiced and unvoiced signal amplitude, and the vocal tract parameters, etc. all vary slowly with time. Due to the inertial motion of the sounding organ, the voice signal is considered to be approximately constant in a short time (generally 15-30 ms), that is, the voice signal has short-time stationarity. In the invention, because the voice fundamental frequency extraction needs to be carried out by fast Fourier transform which requires that an input signal is stable to achieve an expected effect, the infant voice data needs to be subjected to framing processing (the frame length is 15-30 ms) before the fast Fourier transform is carried out, the infant voice data is divided into a plurality of short sections (called analysis frames) to be respectively analyzed, and the voice fundamental frequency of each frame of analysis frame (namely each frame of the infant voice data in the invention) is extracted from the analysis frames. More specifically, the framing may be implemented by weighting a movable finite-length window, and preferably, in this embodiment, a Hamming window is used for framing, where the Hamming window has a larger lobe attenuation and a smoother low-pass characteristic, and can reflect the frequency characteristic of the short-time signal to a higher degree.
Further, in this embodiment, the sampling frequency of each frame of the infant voice data may be 8820Hz, and the number of sampling points is 256, where the frame of the infant voice data is about 29ms, and fig. 3 is a voice map of the frame of the infant voice data in the time domain when the sampling frequency is 8820Hz and the number of sampling points is 256.
Step S102: and performing fast Fourier transform on the framed infant voice data in the time domain, and then taking an absolute value to obtain the framed infant voice data in the frequency domain.
Fast Fourier Transform (FFT) refers to a Fast algorithm of discrete Fourier transform, which is obtained by improving an algorithm of discrete Fourier transform according to characteristics of odd, even, imaginary, real, etc. of the discrete Fourier transform. The time domain and frequency domain transformation of the framed infant voice data is carried out by utilizing the fast Fourier transform, so that the operation speed can be improved.
Step S103: dividing the framed infant voice data on the frequency domain into a symmetrical first part and a symmetrical second part from the middle position, defining the first part or the second part as an array Z, and taking the logarithm of the array Z according to a preset calculation strategy and recording the logarithm as ZLog.
Specifically, the framed infant voice data in the frequency domain is divided into a first part and a second part which are symmetrical from the middle position, wherein the first part is a first half sequence of the framed infant voice data in the frequency domain, the second part is a second half sequence of the framed infant voice data in the frequency domain, and the first part and the second part are symmetrical, so that when taking logarithm, only one of the sequences needs to take the logarithm.
As shown in fig. 5, which is the first half sequence, i.e., the first part, obtained after converting fig. 3 into framed infant speech data in the frequency domain, wherein the abscissa is frequency and the ordinate is energy amplitude. Since the sampling frequency is 8820Hz, the first half of the sequence is 0-4410 Hz.
Further, in this embodiment, the first part may be represented by the array Z, and then the logarithm of the first part is taken, or the second part may be represented by the array Z, and then the logarithm of the second part is taken, which is not limited in this invention.
Further, in this embodiment, the logarithm of the array Z is taken according to a preset calculation strategy, and is recorded as Zlog, so that the above calculation is realized by using a sampling table look-up and a differential method. When the sampling frequency of each frame of infant voice data is 8820Hz and the number of sampling points is 256, as shown in FIG. 2, taking the logarithm of the array Z according to a preset calculation strategy, and recording the logarithm as ZLog, the method comprises the following steps:
step S201: storing 256 data of m 1024:128:33664, wherein the m 1024:128:33664 is m, the 256 data are sequentially represented as m0, m1, m2 … … m255, and the natural logarithm results of m0, m1 and m2 … … m255 are sequentially represented as n0, n1 and n2 … … n 255;
specifically, the step value is selected according to the number of sampling points of each frame, and generally 1/4 to 1/2 of the number of sampling points of one frame is selected to prevent the pitch from jumping suddenly, in this embodiment, since the number of sampling points is 256, the step value is 1/2 of the number of sampling points 256, and therefore the step value is set to 128.
In the step, the result of m natural logarithm is stored in advance, and when the Zlog is calculated subsequently, the natural logarithm of Z is also calculated, so that the calculation speed can be improved. In this embodiment, Z is an array, and the array Z is a frequency domain result, i.e., an amplitude value, of each frame of infant voice data after FFT, and since the number of sampling points is 256, only the data of the first half or the second half is taken as the array Z, i.e., there are 128 values in the array Z, and a natural logarithm of Z, i.e., a natural logarithm of each element data in the array Z, is calculated.
Step S202: using the formula ln (Z) ═ ln (Z × e)t) T, deforming ln (Z) such that Z=Z*etT is an integer, ZAt [ m0, m255]Within the interval;
specifically, in this step, ln (Z) is transformed into ln (Z × e)t) -t, and by setting the value of t, let Z etAt [ m0, m255]Within the interval. Therefore, when the calculation is carried out again subsequently, the fast calculation can be realized by utilizing the pre-stored natural logarithm n of m.
Step S203: determination of ZThe precise interval [ mq, mq +1 [ ]]Q is [0, 255]The whole number of (1);
step S204: obtaining the natural logarithm result nq of mq according to the formula ln (Z))=nq+(Z-mq)/mq and the formula ln (Z) ═ ln (Z)) T, calculating ln (z) as a result of logarithmizing said first portion Zlog.
In the embodiment, a sampling table look-up and a differential method are used, the natural logarithm results of 256 data are stored in advance, and then ln (Z) can be obtained through rapid calculation according to the natural logarithm results of the 256 data stored in advance.
Step S104: and calculating the autocorrelation coefficient of the Zlog according to a preset autocorrelation coefficient calculation strategy, and extracting the voice fundamental frequency of the infant voice data according to the autocorrelation coefficient of the Zlog and a preset voice fundamental frequency extraction strategy.
Specifically, the autocorrelation coefficient, i.e., the correlation coefficient between a speech segment shifted by n and the original speech array (i.e., the correlation between itself and itself shifted), in this embodiment, the preset autocorrelation coefficient calculation strategy may be an autocovariance formula according to the sequence:
Figure BDA0003050996670000081
and autocorrelation coefficient formula:
Figure BDA0003050996670000082
obtaining the autocorrelation coefficient r of the ZLogk
Wherein, the variable XtRepresenting a time sequence, xtDenotes the t-th point in the time sequence, t-1, 2, 3. cndot. N denotes the sequence XtLength of sequence, mean μ ═ E (X)t) Variance of the sequence σ2=D(Xt)=E((xt-μ)2) And k is the number of lags in the sequence,
Figure BDA0003050996670000083
in the present embodiment, N is 128.
Further, the voice fundamental frequency of the infant voice data is extracted according to the autocorrelation coefficient of Zlog and a preset voice fundamental frequency extraction strategy, and is an atlas (shown in fig. 6) obtained by calculating the autocorrelation coefficient after taking the logarithm of the infant voice data of one frame in a frequency domain, and the distance between each harmonic in the corresponding atlas is used as the voice fundamental frequency. As shown in fig. 3, a one-frame infant voice data map in the time domain is obtained when the sampling frequency is 8820Hz and the number of sampling points is 256.
As shown in fig. 4, fig. 4 is a graph obtained by directly calculating autocorrelation coefficients for a frame of infant speech data in the time domain in fig. 3 by using a fundamental frequency extraction method commonly used in the prior art, where the autocorrelation coefficients r (k) ═ r (T, T + k), and the maximum value of the speech fundamental frequency calculation is F0 ═ 1/T ═ 1/(3.968ms) ═ 252 Hz. (comparing fig. 4 with fig. 5 and 6, the result calculated from fig. 4 is clearly erroneous).
As shown in fig. 5, in order to obtain the absolute value of the one-frame infant voice data in the frequency domain after performing fast fourier transform on the one-frame infant voice data in the time domain in fig. 3, the first part (i.e., the first half sequence) of the one-frame infant voice data atlas in the frequency domain is obtained, and as can be seen from the harmonic wave, the fundamental frequency of the voice in fig. 5 should be around 500 Hz.
As shown in fig. 6, a more obvious fundamental frequency difference is obtained for the map obtained by calculating the autocorrelation coefficient after taking the logarithm of fig. 5, and it is obvious that 500Hz is the true fundamental frequency from fig. 6.
In summary, in the method for extracting a speech fundamental frequency provided in this embodiment, when extracting a speech fundamental frequency, the time-domain framed infant speech data is first converted into the frequency-domain framed infant speech data, logarithm zlo is taken from the frequency-domain framed infant speech data, which is equivalent to once purifying the concentration of frequency energy, and then autocorrelation calculation is performed to greatly improve the calculation accuracy, so that the error rate of extracting the speech fundamental frequency of an infant is reduced.
Furthermore, the Zlog is calculated by adopting a sampling table look-up method and a differential method, so that the calculation speed of the Zlog can be effectively improved.
In another embodiment of the present invention, before performing fast fourier transform on the framed infant speech data to obtain framed infant speech data in the frequency domain, the method further includes:
the sine value and the cosine value of each data graduation in the framed infant voice data are calculated in advance, the sine value and the cosine value of each data graduation are stored into an array, and when the framed infant voice data are subjected to fast Fourier transform, the array is used for carrying out fast Fourier transform.
Specifically, taking the framed infant voice data as an N-point sequence as an example, each data in the framed infant voice data can be expressed as the following formula:
Figure BDA0003050996670000101
wherein k is 0, N-1, which is transformed according to euler's formula to:
Figure BDA0003050996670000102
thus for each of the framed infant speech data, there is a calculation of the sine and cosine value components.
Since the sine value and the cosine value of each data division are used in the fast fourier transform, in this embodiment, the sine value and the cosine value of each data division are calculated and stored in advance, and the existing data is directly called in the fast fourier transform, thereby further improving the calculation speed.
In another embodiment of the present invention, the absolute value of the framed infant speech data in the time domain is obtained by performing fast fourier transform, and then calculating the absolute value by using Newton's iterative method (also called Newton-Raphson method). Specifically, the root needs to be opened when absolute value calculation is carried out, and the calculation precision can be effectively improved by adopting a Newton iteration method.
The invention also provides a device for extracting the baby voice fundamental frequency with high precision, as shown in fig. 7, the device comprises: a preprocessing module 301, a fast fourier transform module 302, a logarithm calculation module 303, an autocorrelation coefficient calculation module 304, and a fundamental speech frequency calculation module 305, wherein:
the preprocessing module 301 is connected to the fast fourier transform module 302, and is configured to acquire infant voice data, perform framing processing on the infant voice data according to a preset voice framing processing strategy, and acquire framed infant voice data in a plurality of frame time domains;
the fast Fourier transform module 302 is connected with the logarithm calculation module 303 and is used for performing fast Fourier transform on the framed infant voice data in the time domain and then taking an absolute value to obtain framed infant voice data in the frequency domain;
the logarithm calculation module 303 is connected to the autocorrelation coefficient calculation module 304, and is configured to divide the framed infant speech data in the frequency domain into a symmetrical first part and a symmetrical second part from the middle position, define the first part or the second part as an array Z, and log the array Z according to a preset calculation strategy, and record the log as Zlog;
an autocorrelation coefficient calculation module 304, connected to the speech fundamental frequency calculation module 305, for calculating an autocorrelation coefficient of Zlog according to a preset autocorrelation coefficient calculation strategy;
and the speech fundamental frequency calculation module 305 is configured to extract a speech fundamental frequency of the infant speech data according to the autocorrelation coefficient of Zlog and a preset speech fundamental frequency extraction strategy.
Further, as shown in fig. 8, the pre-processing module 301 includes a pre-emphasis unit 3011 and a framing unit 3012 connected to the pre-emphasis unit 3011, where:
the pre-emphasis unit 3011 is configured to perform pre-emphasis processing on the infant speech data to improve the high-frequency resolution of the infant speech data;
the framing unit 3012 is configured to perform framing processing on the pre-emphasized infant speech data by using a hamming window to obtain framed infant speech data in a plurality of frames of time domains.
In another embodiment of the present invention, as shown in fig. 9, the apparatus further includes a data division value storage module 306, the data division value storage module 306 is connected to the fast fourier transform module 302, and is configured to store sine values and cosine values of each data division in the framed infant voice data as an array in advance, and the fast fourier transform module 306 is further configured to perform fast fourier transform using the array when performing fast fourier transform on the framed infant voice data.
The invention also provides computer equipment which comprises a memory and a processor, wherein the memory stores computer programs, and the processor realizes the steps of the high-precision extraction method of the baby voice fundamental frequency when executing the computer programs.
In summary, in the method for extracting a speech fundamental frequency provided in this embodiment, when extracting a speech fundamental frequency, the time-domain framed infant speech data is first converted into the frequency-domain framed infant speech data, logarithm zlo is taken from the frequency-domain framed infant speech data, which is equivalent to once purifying the concentration of frequency energy, and then autocorrelation calculation is performed to greatly improve the calculation accuracy, so that the error rate of extracting the speech fundamental frequency of an infant is reduced. Furthermore, the ZLog is calculated by adopting a sampling table look-up method and a differential method, the calculation speed of the ZLog can be effectively improved, sine values and cosine values of data divisions in the framed infant voice data are stored in advance and stored as an array, and the calculation amount can be greatly reduced and the calculation speed can be improved by utilizing the array stored in advance when fast Fourier transform is carried out. Therefore, high-precision calculation of the voice fundamental frequency is guaranteed under the condition of low computation amount.
The terms and expressions used in the specification of the present invention have been set forth for illustrative purposes only and are not meant to be limiting. The terms "first" and "second" used herein in the claims and the description of the present invention are for the purpose of convenience of distinction, have no special meaning, and are not intended to limit the present invention. It will be appreciated by those skilled in the art that changes could be made to the details of the above-described embodiments without departing from the underlying principles thereof. The scope of the invention is, therefore, indicated by the appended claims, in which all terms are intended to be interpreted in their broadest reasonable sense unless otherwise indicated.

Claims (10)

1. A method for extracting a fundamental frequency of infant voice with high precision is characterized by comprising the following steps:
acquiring infant voice data, and performing framing processing on the infant voice data according to a preset voice framing processing strategy to obtain framed infant voice data on a plurality of frame time domains;
performing fast Fourier transform on the framed infant voice data in the time domain, and then taking an absolute value to obtain framed infant voice data in a frequency domain;
dividing the framed infant voice data on the frequency domain into a symmetrical first part and a symmetrical second part from the middle position, wherein the first part is a first half part sequence of the framed infant voice data on the frequency domain, the second part is a second half part sequence of the framed infant voice data on the frequency domain, the first part or the second part is defined as an array Z, and the array Z is logarithmized according to a preset calculation strategy and is recorded as ZLog;
calculating the autocorrelation coefficient of the Zlog according to a preset autocorrelation coefficient calculation strategy, obtaining a correlation coefficient-frequency map according to the autocorrelation coefficient of the Zlog, and extracting the speech fundamental frequency of the infant speech data according to the harmonic distance in the autocorrelation coefficient-frequency map.
2. The method for extracting the infant speech fundamental frequency with high precision according to claim 1, wherein the framing the infant speech data according to a preset speech framing processing strategy comprises:
pre-emphasis processing is carried out on the infant voice data, and the high-frequency resolution of the infant voice data is improved;
and performing framing processing on the infant voice data subjected to pre-emphasis processing by utilizing a Hamming window.
3. The method for extracting the fundamental frequency of infant speech with high precision according to claim 1, wherein before performing fast fourier transform on the framed infant speech data, taking absolute values, and obtaining framed infant speech data in a frequency domain, the method further comprises:
and pre-calculating sine values and cosine values of all data indexes in the framed infant voice data, storing the sine values and the cosine values of all data indexes as an array, and performing fast Fourier transform by using the array when performing fast Fourier transform on the framed infant voice data.
4. The method for extracting the fundamental frequency of infant speech with high accuracy according to claim 1, wherein the sampling frequency of each frame of the infant speech data is 8820Hz, and the number of sampling points is 256.
5. The method for extracting fundamental frequency of infant speech with high precision as claimed in claim 4, wherein said logarithmizing the array Z according to a preset calculation strategy, and recording as ZLog comprises:
pre-storing 256 data of m =1024:128:33664, wherein m =1024:128:33664 is that m starts from 1024, steps are carried out by 128, and the data is finished by 33664, and the data are sequentially represented as m0, m1, m2 … … m255, and the natural logarithm results of m0, m1, m2 … … m255 are sequentially represented as n0, n1 and n2 … … n 255;
using the formula ln (Z) = ln (Z × e)t) T, deforming ln (Z) such that Z=Z*et,ZAt [ m0, m255]Within the interval, t is an integer;
determination of ZThe precise interval [ mq, mq +1 [ ]]Q is [0, 255]The whole number of (1);
obtaining the natural logarithm result nq of mq according to the formula ln (Z))=nq+(Z-mq)/mq and the formula ln (Z) = ln (Z)) T, calculating ln (Z) as a result Zlog of the logarithm of the array Z.
6. The method for extracting the infant speech fundamental frequency with high precision according to claim 1, wherein the absolute value is calculated by using a newton iteration method after the fast fourier transform is performed on the framed infant speech data in the time domain.
7. An apparatus for extracting fundamental frequency of baby voice with high precision, the apparatus comprising: the device comprises a preprocessing module, a fast Fourier transform module, a logarithm calculation module, an autocorrelation coefficient calculation module and a voice fundamental frequency calculation module, wherein:
the preprocessing module is connected with the fast Fourier transform module and used for acquiring infant voice data, and framing the infant voice data according to a preset voice framing processing strategy to acquire framed infant voice data on a plurality of frames of time domains;
the fast Fourier transform module is connected with the logarithm calculation module and is used for carrying out fast Fourier transform on the framed infant voice data in the time domain to obtain framed infant voice data in the frequency domain;
the logarithm calculation module is connected with the autocorrelation coefficient calculation module and is used for dividing the framed infant voice data on the frequency domain into a symmetrical first part and a symmetrical second part from the middle position, wherein the first part is a first half part sequence of the framed infant voice data on the frequency domain, the second part is a second half part sequence of the framed infant voice data on the frequency domain, the first part or the second part is defined as an array Z, and the logarithm of the array Z is taken according to a preset calculation strategy and is recorded as Zlog;
the autocorrelation coefficient calculation module is connected with the voice fundamental frequency calculation module and used for calculating the autocorrelation coefficient of the Zlog according to a preset autocorrelation coefficient calculation strategy;
the voice fundamental frequency calculation module is used for obtaining a correlation coefficient-frequency map according to the autocorrelation coefficient of the Zlog and extracting the voice fundamental frequency of the infant voice data according to the harmonic distance in the autocorrelation coefficient-frequency map.
8. The device for extracting infant speech fundamental frequency with high precision according to claim 7, wherein the preprocessing module comprises a pre-emphasis unit and a framing unit connected with the pre-emphasis unit, wherein:
the pre-emphasis unit is used for pre-emphasizing the infant voice data to improve the high-frequency resolution of the infant voice data;
the framing unit is used for performing framing processing on the infant voice data subjected to pre-emphasis processing by utilizing a Hamming window to obtain framed infant voice data on a plurality of frame time domains.
9. The device for extracting the infant speech fundamental frequency with high precision according to claim 7, further comprising a data division value storage module, wherein the data division value storage module is connected with the fast Fourier transform module and is used for storing sine values and cosine values of each data division in the framed infant speech data as an array in advance, and the fast Fourier transform module is further used for performing fast Fourier transform by using the array when performing the fast Fourier transform on the framed infant speech data.
10. Computer device, characterized in that it comprises a memory and a processor, wherein said memory stores a computer program, and said processor executes said computer program to implement the steps of the method for extracting baby speech fundamental frequency with high precision as claimed in any one of claims 1 to 6.
CN202110487291.1A 2021-05-05 2021-05-05 High-precision extraction method and device for baby voice fundamental frequency and computer equipment Active CN113205827B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110487291.1A CN113205827B (en) 2021-05-05 2021-05-05 High-precision extraction method and device for baby voice fundamental frequency and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110487291.1A CN113205827B (en) 2021-05-05 2021-05-05 High-precision extraction method and device for baby voice fundamental frequency and computer equipment

Publications (2)

Publication Number Publication Date
CN113205827A CN113205827A (en) 2021-08-03
CN113205827B true CN113205827B (en) 2022-02-15

Family

ID=77029887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110487291.1A Active CN113205827B (en) 2021-05-05 2021-05-05 High-precision extraction method and device for baby voice fundamental frequency and computer equipment

Country Status (1)

Country Link
CN (1) CN113205827B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178897A (en) * 2007-12-05 2008-05-14 浙江大学 Speaking man recognizing method using base frequency envelope to eliminate emotion voice
CN107833581A (en) * 2017-10-20 2018-03-23 广州酷狗计算机科技有限公司 A kind of method, apparatus and readable storage medium storing program for executing of the fundamental frequency for extracting sound
WO2018138543A1 (en) * 2017-01-24 2018-08-02 Hua Kanru Probabilistic method for fundamental frequency estimation
CN109087627A (en) * 2018-10-16 2018-12-25 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN110379438A (en) * 2019-07-24 2019-10-25 山东省计算中心(国家超级计算济南中心) A kind of voice signal fundamental detection and extracting method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178897A (en) * 2007-12-05 2008-05-14 浙江大学 Speaking man recognizing method using base frequency envelope to eliminate emotion voice
WO2018138543A1 (en) * 2017-01-24 2018-08-02 Hua Kanru Probabilistic method for fundamental frequency estimation
CN107833581A (en) * 2017-10-20 2018-03-23 广州酷狗计算机科技有限公司 A kind of method, apparatus and readable storage medium storing program for executing of the fundamental frequency for extracting sound
CN109087627A (en) * 2018-10-16 2018-12-25 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN110379438A (en) * 2019-07-24 2019-10-25 山东省计算中心(国家超级计算济南中心) A kind of voice signal fundamental detection and extracting method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
噪声环境下的语音基频检测算法研究;王小标;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20210115;I136-248 *

Also Published As

Publication number Publication date
CN113205827A (en) 2021-08-03

Similar Documents

Publication Publication Date Title
CN106486131B (en) A kind of method and device of speech de-noising
CN109243491B (en) Method, system and storage medium for emotion recognition of speech in frequency spectrum
US4038503A (en) Speech recognition apparatus
US8280724B2 (en) Speech synthesis using complex spectral modeling
CN110459241B (en) Method and system for extracting voice features
CN109256138B (en) Identity verification method, terminal device and computer readable storage medium
US20210193149A1 (en) Method, apparatus and device for voiceprint recognition, and medium
CN110880329B (en) Audio identification method and equipment and storage medium
US20050143997A1 (en) Method and apparatus using spectral addition for speaker recognition
US8566084B2 (en) Speech processing based on time series of maximum values of cross-power spectrum phase between two consecutive speech frames
Kesarkar et al. Feature extraction for speech recognition
CN110390947B (en) Method, system, device and storage medium for determining sound source position
CN111128213A (en) Noise suppression method and system for processing in different frequency bands
US20100094622A1 (en) Feature normalization for speech and audio processing
US9076446B2 (en) Method and apparatus for robust speaker and speech recognition
CN112599148A (en) Voice recognition method and device
Hsu et al. Robust voice activity detection algorithm based on feature of frequency modulation of harmonics and its DSP implementation
CN113205827B (en) High-precision extraction method and device for baby voice fundamental frequency and computer equipment
CN112309425A (en) Sound tone changing method, electronic equipment and computer readable storage medium
CN110875037A (en) Voice data processing method and device and electronic equipment
CN108074588B (en) Pitch calculation method and pitch calculation device
CN112397087B (en) Formant envelope estimation method, formant envelope estimation device, speech processing method, speech processing device, storage medium and terminal
CN115985332A (en) Voice tone changing method, storage medium and electronic equipment
Ito et al. Sinusoidal modeling for nonstationary voiced speech based on a local vector transform
Mallidi et al. Robust speaker recognition using spectro-temporal autoregressive models.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240509

Address after: 561200, Group 48, Beiguan Village, Chengguan Town, Zhenning Buyi and Miao Autonomous County, Anshun City, Guizhou Province

Patentee after: Zhang Ping

Country or region after: China

Address before: 561299 group 5, Beiguan village, Chengguan Town, Zhenning Buyei and Miao Autonomous County, Anshun City, Guizhou Province

Patentee before: Zhang Qian

Country or region before: China

TR01 Transfer of patent right