CN112086104B

CN112086104B - Method and device for obtaining fundamental frequency of audio signal, electronic equipment and storage medium

Info

Publication number: CN112086104B
Application number: CN202010829745.4A
Authority: CN
Inventors: 方桂萍; 肖全之; 闫玉凤
Original assignee: Zhuhai Jieli Technology Co Ltd
Current assignee: Zhuhai Jieli Technology Co Ltd
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2022-04-29
Anticipated expiration: 2040-08-18
Also published as: CN112086104A

Abstract

The application relates to a method, a device, an electronic device and a medium for obtaining a fundamental frequency of an audio signal. The method comprises the following steps: framing the time domain audio signal to obtain a plurality of signal frames; carrying out low-pass filtering on each signal frame, and downsampling to obtain a plurality of first audio points; taking the first audio point in the audio point selection interval as a first target audio point, and determining the autocorrelation error of each first target audio point to form a curve; determining a first interval corresponding to the minimum valley value of the curve and a second interval corresponding to the second minimum valley value; sampling each signal frame subjected to low-pass filtering to obtain a plurality of second audio frequency points; taking second audio points positioned in the first interval and the second interval as second target audio points, and determining the autocorrelation error of each second target audio point; taking the frequency corresponding to the second target audio point with the minimum autocorrelation error as the initial fundamental frequency of each signal frame; and determining the fundamental frequency of each signal frame according to the initial fundamental frequency. The method and the device can improve the accuracy of obtaining the fundamental frequency.

Description

Method and device for obtaining fundamental frequency of audio signal, electronic equipment and storage medium

Technical Field

The present application relates to the field of audio technologies, and in particular, to a method and an apparatus for obtaining a fundamental frequency of an audio signal, an electronic device, and a storage medium.

Background

With the development of audio processing technology, the speech synthesis technology has become an important part of people's daily life, and has been widely applied to products such as live broadcast sound cards, multimode karaoke wheat and the like. For the speech synthesis technology, because the pitch of the speech and the frequency of the fundamental frequency have a corresponding relationship, the accurate acquisition of the fundamental frequency is related to the accuracy of the synthesized speech.

At present, the fundamental frequency in the audio signal is generally obtained through a trained neural network, however, the range of finding the fundamental frequency is too large in the current fundamental frequency obtaining method, and therefore the obtained fundamental frequency is low in precision.

Disclosure of Invention

In view of the above, it is necessary to provide a method, an apparatus, an electronic device and a storage medium for obtaining a fundamental frequency of an audio signal.

A method of fundamental frequency acquisition of an audio signal, the method comprising:

framing a time domain audio signal to obtain a plurality of time domain signal frames of the time domain audio signal;

performing low-pass filtering on each time domain signal frame, and performing down-sampling on each time domain signal frame subjected to low-pass filtering to obtain a plurality of first audio data points contained in each time domain signal frame subjected to low-pass filtering;

taking an audio data point positioned in a preset audio data point selection interval in the plurality of first audio data points as a first target audio data point to obtain a plurality of first target audio data points, and determining the autocorrelation error of each first target audio data point to form an autocorrelation error curve;

determining a first time interval corresponding to the minimum wave valley value of the autocorrelation error curve and a second time interval corresponding to the second minimum wave valley value;

performing upsampling on each time domain signal frame subjected to low-pass filtering to obtain a plurality of second audio data points contained in each time domain signal frame subjected to low-pass filtering;

taking the audio data points in the first time interval and the second time interval as second target audio data points to obtain a plurality of second target audio data points, and determining the autocorrelation error of each second target audio data point;

taking the audio frequency corresponding to the second target audio data point with the minimum autocorrelation error as the initial fundamental frequency of each time domain signal frame;

determining the fundamental frequency of each time domain signal frame according to the initial fundamental frequency; the method comprises the following steps: extracting initial fundamental frequencies corresponding to a preset number of time domain signal frames from the initial fundamental frequencies to serve as target fundamental frequencies corresponding to the time domain signal frames; determining a current time domain signal frame; if the current time domain signal frame is the first frame, taking the initial fundamental frequency corresponding to the current time domain signal frame as the fundamental frequency of the current time domain signal frame; if the current time domain signal frame is a non-first frame, acquiring initial fundamental frequencies corresponding to a preset number of time domain signal frames before the current time domain signal frame; and taking the audio median of the initial fundamental frequencies corresponding to the current time domain signal frame and the initial fundamental frequencies corresponding to the preset number of time domain signal frames before the current time domain signal frame as the fundamental frequency of the current time domain signal frame.

In one embodiment, the determining the autocorrelation error of each first target audio data point comprises: acquiring the interval length of a preset first reference time interval; determining a second reference time interval corresponding to each first target audio data point based on the interval length; and obtaining the self-correlation error of each first target audio data point according to the first reference time interval and the second reference time interval.

In one embodiment, the obtaining the autocorrelation error of each first target audio data point according to the first reference time interval and the second reference time interval includes: taking the audio data points in the first reference time interval in the plurality of first audio data points as first reference frequency points to obtain a plurality of first reference frequency points, and taking the frequencies of the plurality of first reference frequency points as first reference frequencies; taking the audio data points in the second reference time interval in the plurality of first audio data points as second reference frequency points to obtain a plurality of second reference frequency points, and taking the frequencies of the plurality of second reference frequency points as second reference frequencies; acquiring error square values of the first reference frequencies and the second reference frequencies to obtain a plurality of error square values; and summing the error square values to obtain the autocorrelation error of each first target audio data point.

In one embodiment, the determining a first time interval corresponding to a lowest-wavelet valley and a second time interval corresponding to a next-lowest-wavelet valley of the autocorrelation error curve includes: determining a first time value corresponding to the minimum wave valley value and a second time value corresponding to the second minimum wave valley value; acquiring a preset duration interval range; and obtaining the first time interval according to the first time value and the duration interval range, and obtaining the second time interval according to the second time value and the duration interval range.

In one embodiment, the obtaining the first time interval according to the first time value and the duration interval range, and obtaining the second time interval according to the second time value and the duration interval range includes: generating the first time interval by taking the first time value as a first interval midpoint and taking the duration interval range as an interval length between the first interval midpoint and an interval endpoint; and/or generating the second time interval by taking the second time value as a second interval midpoint and taking the duration range as an interval length between the second interval midpoint and an interval endpoint.

In one embodiment, the preset number of time-domain signal frames before the current time-domain signal frame includes: four time domain signal frames located before the current time domain signal frame.

In one embodiment, after determining the fundamental frequency of each time-domain signal frame according to the initial fundamental frequency, the method further includes: determining an initial pitch of each time domain signal frame based on the fundamental frequency of each time domain signal frame; acquiring an ascending tone pitch and a descending tone pitch corresponding to the initial pitch; and performing harmony processing on each time domain signal frame by using the pitch-rising pitch and the pitch-falling pitch.

An apparatus for obtaining a fundamental frequency of an audio signal, the apparatus comprising:

the audio signal framing module is used for framing the time domain audio signal to obtain a plurality of time domain signal frames of the time domain audio signal;

the signal frame down-sampling module is used for performing low-pass filtering on each time domain signal frame and performing down-sampling on each time domain signal frame subjected to low-pass filtering to obtain a plurality of first audio data points contained in each time domain signal frame subjected to low-pass filtering;

the first autocorrelation module is used for taking an audio data point positioned in a preset audio data point selection interval in the plurality of first audio data points as a first target audio data point to obtain a plurality of first target audio data points, and determining an autocorrelation error of each first target audio data point to form an autocorrelation error curve;

a trough interval determining module, configured to determine a first time interval corresponding to a minimum trough value of the autocorrelation error curve and a second time interval corresponding to a next minimum trough value;

a signal frame upsampling module, configured to perform upsampling on each time-domain signal frame after the low-pass filtering, to obtain a plurality of second audio data points included in each time-domain signal frame after the low-pass filtering;

a second autocorrelation module, configured to take audio data points located in the first time interval and the second time interval among the plurality of second audio data points as second target audio data points, obtain a plurality of second target audio data points, and determine an autocorrelation error of each second target audio data point;

an initial fundamental frequency determining module, configured to use an audio frequency corresponding to a second target audio data point with a smallest autocorrelation error as an initial fundamental frequency of each time-domain signal frame;

a signal frame fundamental frequency determining module, configured to determine a fundamental frequency of each time-domain signal frame according to the initial fundamental frequency; the base station is further configured to extract initial fundamental frequencies corresponding to a preset number of time-domain signal frames from the initial fundamental frequencies, and use the initial fundamental frequencies as target fundamental frequencies corresponding to the time-domain signal frames; determining a current time domain signal frame; if the current time domain signal frame is the first frame, taking the initial fundamental frequency corresponding to the current time domain signal frame as the fundamental frequency of the current time domain signal frame; if the current time domain signal frame is a non-first frame, acquiring initial fundamental frequencies corresponding to a preset number of time domain signal frames before the current time domain signal frame; and taking the audio median of the initial fundamental frequencies corresponding to the current time domain signal frame and the initial fundamental frequencies corresponding to the preset number of time domain signal frames before the current time domain signal frame as the fundamental frequency of the current time domain signal frame.

An electronic device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

The method, the device, the electronic equipment and the storage medium for obtaining the fundamental frequency of the audio signal are used for framing the time domain audio signal to obtain a plurality of time domain signal frames of the time domain audio signal; performing low-pass filtering on each time domain signal frame, and performing down-sampling on each time domain signal frame subjected to low-pass filtering to obtain a plurality of first audio data points contained in each time domain signal frame subjected to low-pass filtering; taking an audio data point positioned in a preset audio data point selection interval in the plurality of first audio data points as a first target audio data point to obtain a plurality of first target audio data points, and determining the autocorrelation error of each first target audio data point to form an autocorrelation error curve; determining a first time interval corresponding to the minimum wave valley value of the autocorrelation error curve and a second time interval corresponding to the second minimum wave valley value; performing up-sampling on each time domain signal frame subjected to low-pass filtering to obtain a plurality of second audio data points contained in each time domain signal frame subjected to low-pass filtering; taking the audio data points in the first time interval and the second time interval in the plurality of second audio data points as second target audio data points to obtain a plurality of second target audio data points, and determining the self-correlation error of each second target audio data point; taking the audio frequency corresponding to the second target audio data point with the minimum autocorrelation error as the initial fundamental frequency of each time domain signal frame; and determining the fundamental frequency of each time domain signal frame according to the initial fundamental frequency. According to the method and the device, the trough of the autocorrelation error curve is searched for by the first target audio data point in the audio data point selection interval obtained after down sampling, and then the autocorrelation error is obtained by up sampling on the second target audio data point near the trough, so that the range for searching the fundamental frequency can be reduced, and the accuracy for obtaining the fundamental frequency is improved.

Drawings

FIG. 1 is a flowchart illustrating a method for obtaining a fundamental frequency of an audio signal according to an embodiment;

FIG. 2 is a schematic flow chart illustrating the process of determining the autocorrelation error for each first target audio data point in one embodiment;

FIG. 3 is a schematic flow chart illustrating an embodiment of obtaining an autocorrelation error of each first target audio data point according to a first reference time interval and a second reference time interval;

FIG. 4 is a flow chart illustrating a process for determining a first time interval corresponding to a lowest valley and a second time interval corresponding to a next lowest valley of an autocorrelation error curve according to one embodiment;

FIG. 5 is a schematic diagram of an improved fundamental frequency search based harmonic processing system according to an exemplary embodiment;

FIG. 6 is a diagram illustrating a location of a first buffer storing audio data points in an example application;

FIG. 7 is a graph of the square error value of an audio signal in an example of an application;

FIG. 8 is a block diagram showing an exemplary embodiment of an apparatus for obtaining a fundamental frequency of an audio signal;

FIG. 9 is a diagram illustrating an internal structure of an electronic device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In an embodiment, as shown in fig. 1, a method for obtaining a fundamental frequency of an audio signal is provided, which is exemplified by applying the method to a terminal, and in this embodiment, the method includes the following steps:

step S101, the terminal frames the time domain audio signal to obtain a plurality of time domain signal frames of the time domain audio signal.

The time domain audio signal is an audio signal which needs to be subjected to audio processing, and the terminal can perform framing processing on the time domain audio signal so as to obtain a plurality of time domain signal frames of the time domain audio signal. Specifically, the terminal may extract the time-domain audio signal in a manner of processing a frame with a length of 5ms, thereby obtaining a plurality of time-domain signal frames.

And S102, performing low-pass filtering on each time domain signal frame by the terminal, and performing down-sampling on each time domain signal frame subjected to low-pass filtering to obtain a plurality of first audio data points contained in each time domain signal frame subjected to low-pass filtering.

After the terminal obtains the time domain signal frames forming the time domain audio signal, each time domain signal frame can be subjected to low-pass filtering processing through a low-pass filter, and the time domain signal frames subjected to low-pass filtering are subjected to down-sampling processing by utilizing low-frequency sampling frequency, so that a plurality of audio sampling data points are obtained and serve as a plurality of first audio data points contained in each time domain signal frame.

Step S103, the terminal uses an audio data point located in a preset audio data point selection interval among the plurality of first audio data points as a first target audio data point to obtain a plurality of first target audio data points, and determines an autocorrelation error of each first target audio data point to form an autocorrelation error curve.

The audio data point selection interval can be set according to the user requirement and corresponds to the current time domain signal frame. Specifically, the terminal may select, according to the set audio data point selection interval, an audio data point located in the audio data point selection interval corresponding to the current time-domain signal frame from the first audio data points, and use the selected audio data point as a first target audio data point corresponding to the current time-domain signal frame, thereby obtaining a plurality of first target audio data points. An autocorrelation error calculation may then be performed on each of the resulting first target audio data points to form an autocorrelation error curve for the current time-domain signal frame.

In step S104, the terminal determines a first time interval corresponding to the minimum valley value of the autocorrelation error curve and a second time interval corresponding to the second minimum valley value.

After the terminal obtains the autocorrelation error curve in step S103, the terminal may find the minimum valley value and the next minimum valley value of the curve from the autocorrelation error curve, and respectively obtain the first time interval and the second time interval of the current time-domain signal frame based on the obtained minimum valley value and the next minimum valley value.

And step S105, the terminal performs upsampling on each time domain signal frame after the low-pass filtering to obtain a plurality of second audio data points contained in each time domain signal frame after the low-pass filtering.

The terminal may then sample each of the low-pass filtered time-domain signal frames again, and at this time, may perform upsampling using a high sampling frequency to improve the time resolution, so that a plurality of second audio data points included in each time-domain signal frame may be obtained.

Step S106, the terminal uses the audio data points located in the first time interval and the second time interval from the plurality of second audio data points as second target audio data points to obtain a plurality of second target audio data points, and determines the autocorrelation error of each second target audio data point.

After the terminal obtains the second audio data points, the audio data points located in the first time interval and the second time interval corresponding to the current time domain signal frame can be selected from the second audio data points, and the audio data points are used as second target audio data points corresponding to the current time domain signal frame, so that a plurality of second target audio data points are obtained. And then, performing autocorrelation error calculation on each obtained second target audio data point again to obtain autocorrelation error of each second target audio data point corresponding to the current time-domain signal frame.

Step S107, the terminal takes the audio frequency corresponding to the second target audio data point with the minimum autocorrelation error as the initial fundamental frequency of each time domain signal frame;

and step S108, the terminal determines the fundamental frequency of each time domain signal frame according to the initial fundamental frequency.

After the terminal obtains the autocorrelation error of each second target audio data point corresponding to the current time domain signal frame, the second target audio data point with the minimum autocorrelation error can be selected from the second target audio data points, the corresponding audio frequency of the second target audio data point is used as the initial fundamental frequency of the current time domain signal frame, and meanwhile, the terminal can execute the process on each time domain signal frame of the time domain audio signal, so that the initial fundamental frequency of each time domain signal frame is obtained. Finally, the terminal can determine the fundamental frequency of the current time domain signal frame according to the obtained multiple initial fundamental frequencies, and repeat the process, thereby obtaining the fundamental frequency of each time domain signal frame.

In the method for obtaining the fundamental frequency of the audio signal, the terminal frames the time domain audio signal to obtain a plurality of time domain signal frames of the time domain audio signal; performing low-pass filtering on each time domain signal frame, and performing down-sampling on each time domain signal frame subjected to low-pass filtering to obtain a plurality of first audio data points contained in each time domain signal frame subjected to low-pass filtering; taking an audio data point positioned in a preset audio data point selection interval in the plurality of first audio data points as a first target audio data point to obtain a plurality of first target audio data points, and determining the autocorrelation error of each first target audio data point to form an autocorrelation error curve; determining a first time interval corresponding to the minimum wave valley value of the autocorrelation error curve and a second time interval corresponding to the second minimum wave valley value; performing up-sampling on each time domain signal frame subjected to low-pass filtering to obtain a plurality of second audio data points contained in each time domain signal frame subjected to low-pass filtering; taking the audio data points in the first time interval and the second time interval in the plurality of second audio data points as second target audio data points to obtain a plurality of second target audio data points, and determining the self-correlation error of each second target audio data point; taking the audio frequency corresponding to the second target audio data point with the minimum autocorrelation error as the initial fundamental frequency of each time domain signal frame; and determining the fundamental frequency of each time domain signal frame according to the initial fundamental frequency. In the application, the terminal searches for the trough of the autocorrelation error curve through the first target audio data point in the audio data point selection interval obtained after down sampling, and then finds out the autocorrelation error of the second target audio data point near the trough through up sampling, so that the range for searching for the fundamental frequency can be reduced, and the accuracy for obtaining the fundamental frequency is improved.

In one embodiment, to further improve the accuracy of the fundamental frequency of each time-domain signal frame, step S108 may further include: the terminal determines a current time domain signal frame; if the current time domain signal frame is the first frame, taking the initial fundamental frequency corresponding to the current time domain signal frame as the fundamental frequency of the current time domain signal frame; if the current time domain signal frame is a non-first frame, acquiring initial fundamental frequencies corresponding to a preset number of time domain signal frames before the current time domain signal frame; and taking the audio median of the initial fundamental frequencies corresponding to the current time domain signal frame and the initial fundamental frequencies corresponding to the preset number of time domain signal frames before the current time domain signal frame as the fundamental frequency of the current time domain signal frame.

The current time domain signal frame is a time domain signal frame whose fundamental frequency needs to be determined currently, the preset number of time domain signal frames before the current time domain signal frame is a time domain signal frame whose time sequence is before the current time domain signal frame, and the preset number can be set as required, for example, 4 frames before the current time domain signal frame. Specifically, the terminal may determine the current time-domain signal frame, if the current time-domain signal frame is the first frame, that is, the first frame, or the number of time-domain signal frames before the current time-domain signal frame is less than the preset number, the terminal may directly use the obtained initial fundamental frequency of the current time-domain signal frame as the fundamental frequency of the current time-domain signal frame, and if the current time-domain signal frame is not the first frame, or the number of time-domain signal frames before the current time-domain signal frame is greater than the preset number, the terminal may sequence the obtained initial fundamental frequency of the current time-domain signal frame and the initial fundamental frequencies of the preset number of frames before the current time-domain signal frame to obtain the fundamental frequency of the current time-domain signal frame.

For example, the preset number may be 4 frames, if the current time-domain signal frame is not the first frame, and the obtained initial fundamental frequency of the current time-domain signal frame is 99Hz, the terminal needs to simultaneously obtain the initial fundamental frequencies corresponding to the first 4 frames of the current time-domain signal frame, which may be 102Hz, 101Hz, 100Hz, and 100Hz, respectively, and the terminal may sort the 5 initial fundamental frequencies and output a median value therefrom, that is, 100Hz is used as the fundamental frequency of the current time-domain signal frame.

In this embodiment, the terminal may obtain the fundamental frequency of the final current time-domain signal frame by performing median processing by combining a plurality of initial fundamental frequencies, which is beneficial to further improving the accuracy of the fundamental frequency of the obtained audio signal.

In one embodiment, as shown in fig. 2, step S103 may include:

step S201, a terminal acquires the interval length of a preset first reference time interval;

step S202, based on the interval length, the terminal determines a second reference time interval corresponding to each first target audio data point.

The first reference time interval may also be set according to actual needs of a user, and corresponds to the current time domain signal frame, and the interval length is the time length of the first reference time interval set by the user, and may be, for example, 10 ms. The terminal may read the interval length of the first reference time interval, and use the interval length as the interval length of the second reference time interval, and may obtain the second reference time interval corresponding to each first target audio data point based on each first target audio data point, for example, may use a time value corresponding to the first target audio data point as one interval endpoint of the second reference time interval, and generate the second reference time interval corresponding to the first target audio data point by using the obtained interval length.

In step S203, the terminal obtains the autocorrelation error of each first target audio data point according to the first reference time interval and the second reference time interval.

After the terminal obtains the first reference time interval and the second reference time interval, the terminal can calculate the autocorrelation error of the first target audio data point through the first reference time interval and the second reference time interval, and simultaneously repeat the above processes, so as to obtain the autocorrelation error of each first target audio data point.

Further, as shown in fig. 3, step S203 may further include:

step S301, the terminal takes the audio data points in the first reference time interval in the plurality of first audio data points as first reference frequency points to obtain a plurality of first reference frequency points, and the frequencies of the plurality of first reference frequency points are taken as first reference frequencies;

step S302, the terminal uses an audio data point located in a second reference time interval from the plurality of first audio data points as a second reference frequency point, to obtain a plurality of second reference frequency points, and uses the frequencies of the plurality of second reference frequency points as second reference frequencies.

Specifically, after the terminal obtains the first reference time interval and the second reference time interval in steps S201 and S202, the terminal may determine a plurality of first audio data points located in the first reference time interval and the second reference time interval respectively, and respectively serve as the first reference frequency point and the second reference frequency point, and may obtain the frequency of each first reference frequency point as a plurality of first reference frequencies and the frequency of each second reference frequency point as a plurality of second reference frequencies.

In step S303, the terminal obtains error square values of each first reference frequency and each second reference frequency to obtain a plurality of error square values.

In the downsampling process, the sampling frequency is fixed, and the interval lengths of the first reference time interval and the second reference time interval are also the same, so that the frequency point number of the first reference frequency point is also the same as the frequency point number of the second reference frequency point, the second reference frequency point corresponding to the first reference frequency point necessarily exists, the error square value of the first reference frequency of each first reference frequency point and the second reference frequency of the corresponding second reference frequency point can be respectively obtained, and a plurality of error square values are obtained.

For example: the first reference frequency points included in the first reference time interval comprise a frequency point A, a frequency point B and a frequency point C which respectively correspond to the frequency A, the frequency B and the frequency C, the second reference frequency points in the second reference time interval comprise a frequency point D, a frequency point E and a frequency point F which respectively correspond to the frequency D, the frequency E and the frequency F, the frequency point A corresponds to the frequency point D, the frequency point B corresponds to the frequency point E, the frequency point C corresponds to the frequency point F, then, the error square value of the frequency A and the frequency D can be respectively calculated, the error square value of the frequency B and the frequency E and the error square value of the frequency C and the frequency F are respectively obtained, and a plurality of error square values are respectively obtained.

Step S304, the terminal sums the error square values to obtain the autocorrelation error of each first target audio data point.

Specifically, the terminal may sum the square error values obtained in step S303 to obtain the autocorrelation error of each first target audio data point, and repeat the above process.

In this embodiment, the terminal may calculate the autocorrelation error with the first reference time interval by determining the second reference time interval corresponding to each first target audio data point, and may obtain the autocorrelation error of each first target audio data point, and the calculation of the autocorrelation error is obtained by calculating the sum of squares of errors between the audio data point in the second reference time interval and the audio data point in the first reference time interval, which is beneficial to improving the accuracy of the obtained autocorrelation error, thereby further improving the accuracy of the fundamental frequency acquisition.

In one embodiment, as shown in fig. 4, step S104 may further include:

in step S401, the terminal determines a first time value corresponding to the minimum valley value and a second time value corresponding to the second minimum valley value.

Specifically, the terminal may first determine a minimum valley value and a next minimum valley value of the autocorrelation error curve from the autocorrelation error curve, and obtain abscissa corresponding to the minimum valley value and the next minimum valley value respectively as a first time value corresponding to the minimum valley value and a second time value corresponding to the next minimum valley value.

Step S402, the terminal acquires a preset duration interval range;

in step S403, the terminal obtains a first time interval according to the first time value and the time interval range, and obtains a second time interval according to the second time value and the time interval range.

The terminal may further obtain the first time interval and the second time interval respectively through the preset time interval range and the first time value and the second time value obtained in step S401.

Further, step S403 may further include: the terminal generates a first time interval by taking the first time value as a first interval midpoint and taking the duration interval range as the interval length between the first interval midpoint and an interval endpoint; and the terminal generates a second time interval by taking the second time value as a midpoint of the second interval and taking the duration interval range as the interval length between the midpoint of the second interval and an interval endpoint.

Specifically, the terminal may use the first time value as a midpoint of the first time interval, that is, a midpoint of the first interval, generate the first time interval by using the midpoint of the first interval and the range of the time interval, use the second time value as a midpoint of the second time interval, that is, a midpoint of the second interval, and generate the second time interval by using the midpoint of the second interval and the range of the time interval.

For example: the time value corresponding to the minimum trough value obtained by the terminal, namely the first time value, is 0.03s, and the set time interval range is 0.02ms, so that the obtained first time interval is between 0.03s-0.02ms and 0.03s +0.02 ms.

In this embodiment, the terminal may obtain the first time interval and the second time interval respectively according to the set duration interval range and the second time value corresponding to the first time value and the second minimum trough value of the minimum trough value, and in addition, the accuracy of the obtained first time interval and the second time interval may be further improved by setting the first time value and the second time value as the midpoint in comparison with the manner of setting the first time value and the second time value as the interval endpoints, thereby further improving the accuracy of obtaining the fundamental frequency.

In one embodiment, after step S108, the method may further include: the terminal determines the initial pitch of each time domain signal frame based on the fundamental frequency of each time domain signal frame; acquiring an up-pitch and a down-pitch corresponding to the initial pitch; and performing harmony processing on each time domain signal frame by using the pitch rising pitch and the pitch falling pitch.

The initial pitch can be obtained by frequency conversion of the fundamental frequency, the pitch up-modulated pitch and the pitch down-modulated pitch are obtained by performing pitch up-modulated processing and pitch down-modulated processing on the initial pitch by the terminal respectively, the terminal can determine the corresponding initial pitch by using the obtained fundamental frequency of each time domain signal frame so as to obtain the pitch up-modulated pitch and the pitch down-modulated pitch, and finally, the obtained pitch up-modulated pitch and the pitch down-modulated pitch can be used for performing harmony processing on each time domain signal frame.

In this embodiment, the terminal may perform harmony processing through the obtained fundamental frequency of each time domain signal frame, and perform harmony processing by using the fundamental frequency with higher precision, which is beneficial to reducing a difference between the harmony audio frequency and the original sound, and improving the harmony processing effect.

In an application example, the harmony processing system and method based on the improved fundamental frequency search, wherein the overall architecture of the system, as shown in fig. 5, may include:

and the sound pickup unit acquires an audio signal. Caching the audio signals, and when the audio signals are accumulated and cached to reach one frame of data, respectively inputting the signals into the fundamental frequency searching unit and the harmony generating unit;

and the fundamental frequency searching unit calculates the corresponding frequency and the key value (namely the scale value) closest to the frequency corresponding to the mode through a fundamental frequency estimation algorithm, extracts the original human voice and outputs the original human voice to the reverberation unit.

The sound effect customizing unit is used for selecting the effect type of voice synthesis according to the requirement of a user;

and the harmony processing unit is used for generating the sound corresponding to the tone pitch in the chord table according to the key value obtained by the fundamental frequency searching unit and the chord table stored in the program, and then mixing the sound (input audio) of the main melody and the sound of the harmony effect and outputting the mixed sound to the reverberation unit.

The electric sound processing unit is used for calculating the frequency value calculated by the fundamental frequency searching module and the ratio of the standard key corresponding to the key value, readjusting the parameters of the processor of the electric sound unit according to the ratio, and then obtaining the result of electric sound calculation and outputting the result to the reverberation unit;

a reverberation unit: the input original human voice, the harmony voice and the electric voice are subjected to reverberation processing and then output to the amplitude limiting unit.

The amplitude limiting unit changes the amplitude of the data due to the harmony unit and the sound processing unit. And the data is re-limited to the bit width of the digital-to-analog conversion unit and is finally output to the audio output unit to obtain harmony audio.

Specifically, the harmony processing method based on the improved fundamental frequency search may include the steps of:

1. a time domain audio signal is acquired, here taking data with a sampling rate of 44.1kHz and a bit width of 16 bits as an example.

2. And performing data caching, and inputting the data into the fundamental frequency searching unit when the data reaches a frame processing length of 5 ms.

3. In the fundamental frequency search unit, the input audio is passed through a low-pass filter with a 4k cut-off frequency, and then the filtered data is down-sampled, where the down-sampling ratio P is 11.025 (input sampling rate 44.1k/4k), rounded down. The second graph of fig. 3 is a 4K downsampled waveform. And writing the output data into the buffer area. The buffer area is a first-in first-out stack storage unit (defined as a first buffer area), and may be a space for storing 50ms 4k sampling rate monaural 16-bit audio. When the audio data is not collected (initial state where the buffer area is idle), the buffer area is filled with the audio signal (mute signal) having the amplitude value of 0.

As shown in fig. 6, the audio data that has entered the buffer area last is marked as 0, the data from 0ms to-10 ms is marked as the reference frequency point in _ data, the interval from-30 ms to-12 ms in the buffer area is set as the audio data point selection interval, the audio data points select each data point (sample point) within the interval as a starting point, for example, at-14 ms, there is a third audio data point, with a bandwidth length of 10ms, then, data of-14 to-4 ms is recorded as s _ data _3, an error square array err _ val _3 of s _ data _3 and in _ data is calculated, an error square array err _ val _ n of all sampling points of the audio data point selection interval is calculated, and the error square values in the array are further summed, namely Serr _ val _ n is sum ((in _ data-s _ data _ n)); here, n represents the selected position from-30 ms to-12 ms, and the error square and curve are obtained, and the position and amplitude of the trough are recorded as shown in FIG. 7. The valley position is the rectangular box position as in fig. 7.

In fig. 7, the error calculation result is shown, the vertical axis is the difference between the frequency point in the frequency point selection interval in fig. 6 and the reference frequency point in _ data, and the horizontal axis is the time axis and the unit is second(s).

And recording the trough value into an array, searching the minimum value and the secondary minimum value in the array, upsampling the signal to 88k (the sampling frequency is high, the time resolution can be improved), and searching the minimum value (fundamental frequency) in a preset time (0.02ms) range near the minimum value and a preset time (0.02ms) range near the secondary minimum value respectively.

For example, the minimum valleys found at low sampling rates are 0.0033s and 0.004s, respectively. Then 88k is used again for upsampling processing, after upsampling, in the interval of 0.0033s-0.02ms to 0.0033s +0.02ms, autocorrelation error is calculated once again, the minimum value (fundamental frequency) is searched, in the interval of 0.004s-0.02ms to 0.004s +0.02ms, the minimum value (fundamental frequency) is searched again in the same way, the finally obtained minimum value is marked as f0, and the result of 1/time _ min, namely the fundamental frequency, is calculated.

Finally, the base frequency is found, here denoted as f0, and f0 is written into a first-in-first-out buffer (second buffer) of length 5. And sequencing the arrays in the cache area, and outputting a median which is the searched target fundamental frequency.

And (3) converting the frequency of the 12 average musical heights to obtain a key value, wherein the key value of the current fundamental frequency is calculated according to the following formula:

key_index＝round(log(f0/65.41)/log(2)*12)

wherein, 65.41Hz is used as the first key to be sequentially increased, the increasing rule finds the corresponding key according to the 12 average rate in the music, and Round represents the approach to rounding.

4. Harmony effect generation unit: the mode needs to be configured first, the default is C major, and the mode can be switched to C minor or G major.

The unit contains 2 parts.

A first part: the system comprises a counting module, a random number generating module and a storage module, wherein the storage module stores a common chord table. And finding the chord corresponding to the key calculated by the fundamental frequency module. And calculating the chord corresponding to the key. In this embodiment, the key is increased by 3 degrees in pitch and the key is decreased by 3 degrees in pitch, confirming whether the two increased and decreased pitches are in the results of the pre-stored chord composition table.

1) When the two lifted pitch numbers are in the result of the prestored chord composition table, after the key is lifted by 3 degrees, the number K0 of semitones with the difference between the front pitch number and the rear pitch number is calculated, the frequency ratio obtained through the frequency ratio conversion formula is delta0, after the key is lowered by 3 degrees, the number K1 of semitones with the difference between the front pitch number and the rear pitch number is calculated, and the frequency ratio obtained through the frequency ratio conversion formula is delta 1.

2) When the two lifted tone pitches are not in the result of the prestored chord composition table, acquiring the chord table, matching the lifted tone pitch and the lowered tone pitch which are closest to the key in the chord table, then respectively using the closest lifted tone pitch and lowered tone pitch as the first harmonic tone pitch and the second harmonic tone pitch, respectively calculating half tone numbers KO and K1 of the difference between the first harmonic tone pitch and the second harmonic tone pitch and the initial key, and obtaining frequency ratios delta0 and delta1 according to a frequency ratio conversion formula;

frequency ratio conversion equation: deltaN ^ 2 (k/12);

a second part: acquiring an audio signal frame, and an up-modulation frequency ratio delta0 and a down-modulation frequency ratio delta1 of a current key, and using the up-modulation frequency ratio delta0 and the down-modulation frequency ratio delta1 of the current key to up-modulate harmony sound and down-modulate harmony sound; and superposing the frequency spectrums of the rising tone harmony waves and the falling tone harmony waves to obtain target harmony waves, and outputting the target harmony waves.

5. Electric sound effect generation unit: the mode also needs to be configured first, the default is C major, and the mode can be switched to C minor or G major.

6. bypass module (power-off control module): when Bypass is powered off or power is lost, data continues to pass through the harmony module, but harmony processing is not performed.

7. Reverberation: conventional reverberation processing is performed on input data.

8. An amplitude limiting processing output unit:

the data block size is calculated with 50ms as an energy, 10ms as a fifo buffer unit, and then the volume is adjusted. The volume mapping curve is shown in fig. 6, where the horizontal axis is the input amplitude in dB, and the vertical axis is the output amplitude in dB; after the frequency spectrum is processed to 16 bits through amplitude limiting, the data is output to a digital-to-analog conversion module.

The fundamental frequency searching method provided in the application example obtains audio data by down-sampling with a cutoff frequency of 4k, extracts a frequency point of a preset interval in an autocorrelation mode to calculate a relative error, obtains an error amplitude spectrum, records a trough position and an amplitude of the error amplitude spectrum to obtain a trough frequency array, searches for a minimum value and a second minimum value in the array, samples up to 88k (the sampling frequency is high, and the time resolution can be improved), searches for the fundamental frequency (the minimum value of the frequency) within a preset time (0.02ms) range near the minimum value and the second minimum value respectively, and finally searches for the fundamental frequency f0, writes the f0 into a buffer area with a first-in first-out length of 5, sorts the array in the buffer area, and outputs a median, namely a target fundamental frequency. The application example sequentially searches through the trough value, finely searches the data after high sampling and searches the fundamental frequency in a multi-stage fundamental frequency searching mode of median processing, so that the searching range is reduced and the precision of the fundamental frequency searching is improved.

It should be understood that, although the steps in the flowcharts of the present application are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in the figures may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed alternately or in alternation with other steps or at least some of the other steps or stages.

In one embodiment, as shown in fig. 8, there is provided a fundamental frequency acquisition apparatus of an audio signal, including: an audio signal framing module 801, a signal frame down-sampling module 802, a first autocorrelation module 803, a valley interval determination module 804, a signal frame up-sampling module 805, a second autocorrelation module 806, an initial fundamental frequency determination module 807, and a signal frame fundamental frequency determination module 808, wherein:

an audio signal framing module 801, configured to frame a time domain audio signal to obtain multiple time domain signal frames of the time domain audio signal;

a signal frame down-sampling module 802, configured to perform low-pass filtering on each time domain signal frame, and perform down-sampling on each time domain signal frame after the low-pass filtering to obtain a plurality of first audio data points included in each time domain signal frame after the low-pass filtering;

a first autocorrelation module 803, configured to take an audio data point located in a preset audio data point selection interval among the multiple first audio data points as a first target audio data point, obtain multiple first target audio data points, and determine an autocorrelation error of each first target audio data point, so as to form an autocorrelation error curve;

a trough interval determining module 804, configured to determine a first time interval corresponding to a minimum trough value of the autocorrelation error curve and a second time interval corresponding to a second minimum trough value;

a signal frame upsampling module 805, configured to perform upsampling on each time-domain signal frame after the low-pass filtering to obtain a plurality of second audio data points included in each time-domain signal frame after the low-pass filtering;

a second autocorrelation module 806, configured to take audio data points located in the first time interval and the second time interval in the plurality of second audio data points as second target audio data points, obtain a plurality of second target audio data points, and determine an autocorrelation error of each second target audio data point;

an initial fundamental frequency determining module 807, configured to use an audio frequency corresponding to the second target audio data point with the smallest autocorrelation error as an initial fundamental frequency of each time-domain signal frame;

and a signal frame fundamental frequency determining module 808, configured to determine a fundamental frequency of each time-domain signal frame according to the initial fundamental frequency.

In an embodiment, the first autocorrelation module 803 is further configured to obtain an interval length of a preset first reference time interval; determining a second reference time interval corresponding to each first target audio data point based on the interval length; and obtaining the autocorrelation error of each first target audio data point according to the first reference time interval and the second reference time interval.

In an embodiment, the first autocorrelation module 803 is further configured to use an audio data point located in a first reference time interval in the plurality of first audio data points as a first reference frequency point, to obtain a plurality of first reference frequency points, and use the frequencies of the plurality of first reference frequency points as first reference frequencies; taking the audio data points in the second reference time interval in the plurality of first audio data points as second reference frequency points to obtain a plurality of second reference frequency points, and taking the frequencies of the plurality of second reference frequency points as second reference frequencies; acquiring error square values of the first reference frequencies and the second reference frequencies to obtain a plurality of error square values; and summing the error square values to obtain the autocorrelation error of each first target audio data point.

In one embodiment, the valley interval determining module 804 is further configured to determine a first time value corresponding to the lowest-minimum valley value and a second time value corresponding to the next lowest-minimum valley value; acquiring a preset duration interval range; and obtaining a first time interval according to the first time value and the range of the time interval, and obtaining a second time interval according to the second time value and the range of the time interval.

In one embodiment, the trough interval determining module 804 is further configured to generate a first time interval by using the first time value as a first interval midpoint and using the duration interval range as an interval length between the first interval midpoint and an interval endpoint; and the time interval generating unit is used for generating a second time interval by taking the second time value as a second interval midpoint and taking the time interval range as an interval length between the second interval midpoint and an interval endpoint.

In one embodiment, the signal frame fundamental frequency determination module 808 is further configured to determine a current time-domain signal frame; if the current time domain signal frame is the first frame, taking the initial fundamental frequency corresponding to the current time domain signal frame as the fundamental frequency of the current time domain signal frame; if the current time domain signal frame is a non-first frame, acquiring initial fundamental frequencies corresponding to a preset number of time domain signal frames before the current time domain signal frame; and taking the audio median of the initial fundamental frequencies corresponding to the current time domain signal frame and the initial fundamental frequencies corresponding to the preset number of time domain signal frames before the current time domain signal frame as the fundamental frequency of the current time domain signal frame.

In one embodiment, the apparatus for obtaining a fundamental frequency of an audio signal further includes: the acoustic processing module is used for determining the initial pitch of each time domain signal frame based on the fundamental frequency of each time domain signal frame; acquiring an up-pitch and a down-pitch corresponding to the initial pitch; and performing harmony processing on each time domain signal frame by using the pitch rising pitch and the pitch falling pitch.

For the specific definition of the fundamental frequency obtaining device of the audio signal, reference may be made to the above definition of the fundamental frequency obtaining method of the audio signal, and details are not described here. The above-mentioned modules in the fundamental frequency obtaining device of the audio signal can be wholly or partially implemented by software, hardware and their combination. The modules can be embedded in a hardware form or independent of a processor in the electronic device, or can be stored in a memory in the electronic device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, an electronic device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 9. The electronic device comprises a processor, a memory, a communication interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the electronic device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of obtaining a fundamental frequency of an audio signal. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the configuration shown in fig. 9 is a block diagram of only a portion of the configuration relevant to the present application, and does not constitute a limitation on the electronic device to which the present application is applied, and a particular electronic device may include more or less components than those shown in the drawings, or combine certain components, or have a different arrangement of components.

In one embodiment, an electronic device is further provided, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for obtaining a fundamental frequency of an audio signal, the method comprising:

2. The method of claim 1, wherein determining the autocorrelation error for each first target audio data point comprises:

acquiring the interval length of a preset first reference time interval;

determining a second reference time interval corresponding to each first target audio data point based on the interval length;

and obtaining the self-correlation error of each first target audio data point according to the first reference time interval and the second reference time interval.

3. The method of claim 2, wherein obtaining the autocorrelation error of each first target audio data point according to the first reference time interval and the second reference time interval comprises:

taking the audio data points in the first reference time interval in the plurality of first audio data points as first reference frequency points to obtain a plurality of first reference frequency points, and taking the frequencies of the plurality of first reference frequency points as first reference frequencies;

taking the audio data points in the second reference time interval in the plurality of first audio data points as second reference frequency points to obtain a plurality of second reference frequency points, and taking the frequencies of the plurality of second reference frequency points as second reference frequencies;

acquiring error square values of the first reference frequencies and the second reference frequencies to obtain a plurality of error square values;

and summing the error square values to obtain the autocorrelation error of each first target audio data point.

4. The method of claim 1, wherein determining a first time interval corresponding to a lowest valley of the autocorrelation error curve and a second time interval corresponding to a next lowest valley of the autocorrelation error curve comprises:

determining a first time value corresponding to the minimum wave valley value and a second time value corresponding to the second minimum wave valley value;

acquiring a preset duration interval range;

and obtaining the first time interval according to the first time value and the duration interval range, and obtaining the second time interval according to the second time value and the duration interval range.

5. The method of claim 4, wherein obtaining the first time interval according to the first time value and the range of duration intervals, and obtaining the second time interval according to the second time value and the range of duration intervals comprises:

generating the first time interval by taking the first time value as a first interval midpoint and taking the duration interval range as an interval length between the first interval midpoint and an interval endpoint;

and/or

And generating the second time interval by taking the second time value as a second interval midpoint and taking the duration interval range as an interval length between the second interval midpoint and an interval endpoint.

6. The method according to any one of claims 1 to 5, wherein the preset number of time-domain signal frames before the current time-domain signal frame comprises: four time domain signal frames located before the current time domain signal frame.

7. The method of claim 1, wherein after determining the fundamental frequency of each time-domain signal frame according to the initial fundamental frequency, the method further comprises:

determining an initial pitch of each time domain signal frame based on the fundamental frequency of each time domain signal frame;

acquiring an ascending tone pitch and a descending tone pitch corresponding to the initial pitch;

and performing harmony processing on each time domain signal frame by using the pitch-rising pitch and the pitch-falling pitch.

8. An apparatus for obtaining a fundamental frequency of an audio signal, the apparatus comprising:

9. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.