CN109346109B

CN109346109B - Fundamental frequency extraction method and device

Info

Publication number: CN109346109B
Application number: CN201811482074.8A
Authority: CN
Inventors: 李骁; 盖于涛; 陈昌滨; 孙晨曦
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-12-05
Filing date: 2018-12-05
Publication date: 2020-02-07
Anticipated expiration: 2038-12-05
Also published as: CN109346109A

Abstract

The embodiment of the application discloses a fundamental frequency extraction method and a fundamental frequency extraction device. One embodiment of the method comprises: extracting candidate base frequency points of each voice frame in the voice signal to be processed based on the acoustic characteristics of the voice signal to be processed; carrying out unvoiced and voiced classification on the voice frames to obtain unvoiced and voiced categories corresponding to the voice frames; and correcting the candidate base frequency points based on the unvoiced and voiced sound classes corresponding to the voice frames and preset base frequency screening conditions, and determining a base frequency sequence of the voice signal to be processed from the corrected candidate base frequency points by adopting a dynamic programming algorithm. The embodiment improves the accuracy of fundamental frequency extraction.

Description

Fundamental frequency extraction method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the field of speech synthesis, and particularly relates to a fundamental frequency extraction method and device.

Background

Speech synthesis is a technique for generating synthesized speech by mechanical or electronic means. In the speech synthesis technology, word segmentation is performed on a text, pronunciation of the text is determined, acoustic features of a speech signal are predicted, and the speech signal is synthesized according to the predicted acoustic features.

The fundamental frequency is the reciprocal of the pitch period, which is the duration of each time the vocal cords are turned on and off. The fundamental frequency is an important acoustic feature in speech synthesis, and whether the fundamental frequency extraction is accurate or not directly influences the accuracy of acoustic modeling in speech synthesis.

Disclosure of Invention

The embodiment of the application provides a fundamental frequency extraction method and a fundamental frequency extraction device.

In a first aspect, an embodiment of the present application provides a fundamental frequency extraction method, including: extracting candidate base frequency points of each voice frame in the voice signal to be processed based on the acoustic characteristics of the voice signal to be processed; carrying out unvoiced and voiced classification on the voice frames to obtain unvoiced and voiced categories corresponding to the voice frames; and correcting the candidate base frequency points based on the unvoiced and voiced sound classes corresponding to the voice frames and preset base frequency screening conditions, and determining a base frequency sequence of the voice signal to be processed from the corrected candidate base frequency points by adopting a dynamic programming algorithm.

In some embodiments, the performing unvoiced/voiced classification on the speech frame to obtain an unvoiced/voiced classification corresponding to each speech frame includes: and inputting the extracted acoustic features of the voice signal to be processed into the trained unvoiced and voiced sound classification model to obtain unvoiced and voiced sound classification results corresponding to each voice frame in the voice signal to be processed.

In some embodiments, the above method further comprises: and training to obtain a trained unvoiced and voiced sound classification model based on the sample voice signals marked with the unvoiced and voiced sound classification information of each voice frame.

In some embodiments, the extracting the candidate base frequency points of each speech frame in the speech signal to be processed based on the acoustic features of the speech signal to be processed includes: down-sampling the voice signal to be processed; calculating a peak point of a cross-correlation function based on acoustic characteristics for a voice frame in the voice signal to be processed after down-sampling, and determining a candidate base frequency point corresponding to the voice signal to be processed after down-sampling according to the peak point; and mapping the candidate base frequency point corresponding to the voice signal to be processed after down-sampling to the voice signal to be processed to obtain the candidate base frequency point of each voice frame in the voice signal to be processed.

In some embodiments, the modifying the candidate base frequency point based on the unvoiced/voiced sound class corresponding to each speech frame and a preset base frequency screening condition includes: determining a fundamental frequency candidate interval according to the distribution characteristics of the candidate fundamental frequency points of each voice frame; correcting the fundamental frequency candidate interval based on the voiced and unvoiced class of each voice frame to obtain a corrected fundamental frequency candidate interval; and replacing the target candidate base frequency point which is not in the corrected base frequency candidate interval with other candidate base frequency points in the voice frame corresponding to the target candidate base frequency point to obtain the corrected candidate base frequency point.

In a second aspect, an embodiment of the present application provides a fundamental frequency extracting apparatus, including: the extraction unit is configured to extract candidate base frequency points of each voice frame in the voice signal to be processed based on the acoustic characteristics of the voice signal to be processed; the classification unit is configured to classify the voiced and unvoiced speech frames to obtain voiced and unvoiced speech categories corresponding to the speech frames; and the determining unit is configured to modify the candidate base frequency points based on the unvoiced/voiced sound class corresponding to each voice frame and a preset base frequency screening condition, and determine a base frequency sequence of the voice signal to be processed from the modified candidate base frequency points by adopting a dynamic programming algorithm.

In some embodiments, the classifying unit is further configured to perform unvoiced-voiced classification on the speech frames to obtain an unvoiced-voiced class corresponding to each speech frame as follows: and inputting the extracted acoustic features of the voice signal to be processed into the trained unvoiced and voiced sound classification model to obtain unvoiced and voiced sound classification results corresponding to each voice frame in the voice signal to be processed.

In some embodiments, the above apparatus further comprises: and the training unit is configured to train and obtain a trained unvoiced and voiced sound classification model based on the sample voice signal labeled with the unvoiced and voiced sound classification information of each voice frame.

In some embodiments, the extracting unit is further configured to extract the candidate base frequency point of each speech frame in the speech signal to be processed according to the following manner based on the acoustic features of the speech signal to be processed: down-sampling the voice signal to be processed; calculating a peak point of a cross-correlation function based on acoustic characteristics for a voice frame in the voice signal to be processed after down-sampling, and determining a candidate base frequency point corresponding to the voice signal to be processed after down-sampling according to the peak point; and mapping the candidate base frequency point corresponding to the voice signal to be processed after down-sampling to the voice signal to be processed to obtain the candidate base frequency point of each voice frame in the voice signal to be processed.

In some embodiments, the determining unit is further configured to modify the candidate base frequency point according to the unvoiced/voiced sound class corresponding to each speech frame and a preset fundamental frequency screening condition as follows: determining a fundamental frequency candidate interval according to the distribution characteristics of the candidate fundamental frequency points of each voice frame; correcting the fundamental frequency candidate interval based on the voiced and unvoiced class of each voice frame to obtain a corrected fundamental frequency candidate interval; and replacing the target candidate base frequency point which is not in the corrected base frequency candidate interval with other candidate base frequency points in the voice frame corresponding to the target candidate base frequency point to obtain the corrected candidate base frequency point.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by one or more processors, cause the one or more processors to implement the fundamental frequency extraction method as provided in the first aspect.

In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, where the program is executed by a processor to implement the fundamental frequency extraction method provided in the first aspect.

According to the fundamental frequency extraction method and device in the embodiment of the application, the candidate fundamental frequency points of each voice frame in the voice signal to be processed are extracted based on the acoustic characteristics of the voice signal to be processed; carrying out unvoiced and voiced classification on the voice frames to obtain unvoiced and voiced categories corresponding to the voice frames; and modifying the candidate base frequency points based on the unvoiced and voiced sound classes corresponding to the voice frames and the preset base frequency screening conditions, and determining a base frequency sequence of the voice signal to be processed from the modified candidate base frequency points by adopting a dynamic programming algorithm, so that unreasonable points in the candidate base frequency points can be effectively filtered, the error rate of frequency multiplication and half frequency is reduced, and the accuracy of base frequency extraction is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram to which embodiments of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a fundamental frequency extraction method according to the present application;

FIG. 3 is a flow diagram of another embodiment of a fundamental frequency extraction method according to the present application;

fig. 4 is a schematic structural diagram of an embodiment of the fundamental frequency extracting apparatus of the present application;

FIG. 5 is a schematic block diagram of a computer system suitable for use in implementing an electronic device according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture to which the fundamental frequency extraction method or fundamental frequency extraction apparatus of the present application can be applied.

As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user 110 may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various voice interaction applications can be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices having an audio input interface and an audio output interface and supporting internet access, including but not limited to smartphones, tablet computers, smartwatches, e-books, smartphones, etc.

The server 105 may be a voice server providing support for a voice service, and the server 105 may receive voice signals sent by the terminal devices 101, 102, and 103, extract acoustic features such as a fundamental frequency for the voice signals, and may also receive voice interaction requests sent by the terminal devices 101, 102, and 103, parse the voice interaction requests, find corresponding text data according to a parsing result, perform voice synthesis based on the acoustic features such as the fundamental frequency, and return generated voice response signals to the terminal devices 101, 102, and 103. The terminal devices 101, 102, 103 may output voice response signals.

It should be noted that the fundamental frequency extracting method provided in the embodiment of the present application may be executed by the terminal device 101, 102, 103 or the server 105, and accordingly, the fundamental frequency extracting device may be disposed in the terminal device 101, 102, 103 or the server 105.

It should be understood that the number of terminal devices, networks, servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a fundamental frequency extraction method according to the present application is shown. The fundamental frequency extraction method comprises the following steps:

step 201, extracting a candidate base frequency point of each voice frame in the voice signal to be processed based on the acoustic characteristics of the voice signal to be processed.

In this embodiment, an executing body (for example, a terminal device or a server shown in fig. 1) of the fundamental frequency extraction method may acquire a speech signal to be processed. The speech signal to be processed may be a signal consisting of a plurality of speech frames that are chronologically consecutive. In practice, the speech signal to be processed may be a natural speech signal, and may be a collected speech signal generated when a person speaks.

The acoustic feature extraction can be performed on the voice signal to be processed, and specifically, various acoustic features can be calculated on the time domain signal and/or the frequency domain signal of the voice signal to be processed. Here, the acoustic features may include, but are not limited to, at least one of: autocorrelation function, cross-correlation function, short-time average amplitude difference function. At least one of the following auto-correlation functions and/or cross-correlation functions may also be included: zero crossing rate, energy, peak rate. The acoustic features may also include features in the frequency domain, such as cepstral coefficients, and the like.

The base frequency points can be determined to serve as candidate base frequency points based on the statistical results of the zero crossing rate and the peak value rate of the autocorrelation function or the cross-correlation function in unit time, or the candidate base frequency points are extracted based on the consistency of the short-time average amplitude difference function and the pitch period; wavelet transform, frequency domain filters, etc. may also be employed to extract candidate base frequency points based on the frequency domain features.

In some optional implementation manners of this embodiment, based on the acoustic features of the speech signal to be processed, the candidate fundamental frequency point of each speech frame may be extracted as follows: firstly, down-sampling a voice signal to be processed; calculating a peak point of a cross-correlation function based on acoustic characteristics for a voice frame in the voice signal to be processed after down-sampling, and determining a candidate base frequency point corresponding to the voice signal to be processed after down-sampling according to the peak point; and finally mapping the candidate base frequency points corresponding to the voice signals to be processed after down-sampling to the voice signals to be processed to obtain the candidate base frequency points of each voice frame in the voice signals to be processed.

For each frame in the voice signal to be processed after the down sampling, determining the relative amplitude of the current frame and the relative position in the voice signal to be processed after the down sampling, and calculating the correlation degree between the current frame and the previous frame to obtain a cross-correlation function. Taking a plurality of peak points in the cross-correlation function as the fundamental frequency candidate values of each frame. For example, it may be determined whether the candidate value differs from other candidate values by more than a certain range, and if so, the quality of the candidate value is determined to be poor. The fundamental frequency candidate values may be screened according to the quality evaluation result, for example, the candidate values with poor quality may be deleted. And the unvoiced and voiced sound category of each frame can be roughly estimated according to the acoustic features, and the fundamental frequency candidate value of the frame corresponding to unvoiced sound is deleted.

Optionally, the candidate value of the fundamental frequency may be further filtered according to the continuity of the fundamental frequency. Usually, a syllable or a phoneme contains a plurality of speech frames, and the fundamental frequency of consecutive speech frames does not fluctuate greatly, i.e. the fundamental frequency has a certain continuity in time sequence. The fundamental frequency candidate values which fluctuate beyond a preset range in a preset time period can be removed, and the influence of the abnormal candidate values on the fundamental frequency extraction accuracy is avoided.

And then, mapping the candidate base frequency points corresponding to the voice signals to be processed after down-sampling to the voice signals to be processed with the original sampling rate to obtain the candidate base frequency points of each voice frame in the voice signals to be processed with the original sampling rate. One candidate base frequency point of the voice signal to be processed after the down-sampling corresponds to an interval in the voice signal to be processed of the original sampling rate, and one point in the interval can be selected as a base frequency point of the voice signal to be processed after the down-sampling and mapped to the voice signal to be processed of the original sampling rate, namely, the candidate base frequency point of the voice signal to be processed of the original sampling rate.

Because the cross-correlation operation relates to multiplication operation with more time consumption, the cross-correlation operation is carried out on the voice signals to be processed by down-sampling the voice signals to be processed and then mapping the voice signals to be processed with the original sampling rate after the base frequency point is determined, so that the operation amount can be reduced and the calculation speed can be increased.

Step 202, performing unvoiced and voiced classification on the voice frames to obtain unvoiced and voiced categories corresponding to the voice frames.

In this embodiment, the voiced and unvoiced class of each speech frame in the speech signal to be processed may be determined. Here, the unvoiced/voiced category is used to represent whether the syllable or phoneme corresponding to the speech frame is unvoiced or voiced.

The category of the unvoiced and voiced sounds is classified based on the vibration of vocal cords during pronunciation, the sound of the vocal cords vibration during pronunciation is voiced, and the pronunciation of the voiced sound has periodic characteristics, so the voiced sound has fundamental frequency; the sound with the vocal cords not vibrating during pronunciation is unvoiced sound, and the pronunciation of the unvoiced sound has no periodic characteristic and thus no fundamental frequency.

In this embodiment, the unvoiced/voiced speech class of the speech frame in the speech signal to be processed may be determined by various methods. In some alternative implementations, unvoiced and voiced speech may be resolved based on parameters such as short-term energy, zero-crossing rate, etc. of the speech signal in the time domain. Because the short-term energy of unvoiced sound is smaller and the short-term energy of voiced sound is larger, the unvoiced sound and the voiced sound can be distinguished by setting a short-term energy threshold. Specifically, it may be determined whether the short-term energy of the speech frame exceeds the threshold, and if the short-term energy of the speech frame exceeds the threshold, the speech frame is determined to be voiced, otherwise, the speech frame is determined to be voiced. The zero-crossing rate of unvoiced sounds is high and the zero-crossing rate of voiced sounds is low, and similarly, unvoiced sounds and voiced sounds can be distinguished by setting a threshold for the zero-crossing rate.

In other alternative implementations, unvoiced and voiced sounds may be distinguished based on short-term energy changes of the autocorrelation function. The short-term energy of the autocorrelation function that changes less is unvoiced, and the short-term energy of the autocorrelation function that changes more is voiced. The average amplitude difference function value will drop rapidly in the period of voiced sound, while the short-time average amplitude difference function of unvoiced sound will not drop rapidly, and unvoiced sound and voiced sound can be identified according to whether the short-time average amplitude difference function of unvoiced sound and voiced sound drops rapidly in a short time.

In some optional implementation manners of this embodiment, the voiced and unvoiced speech frames in the speech signal to be processed may be classified according to the following manners to obtain unvoiced and voiced categories corresponding to the speech frames: and inputting the extracted acoustic features of the voice signal to be processed into the trained unvoiced and voiced sound classification model to obtain unvoiced and voiced sound classification results corresponding to each voice frame in the voice signal to be processed. That is, a training set may be constructed in advance to train an unvoiced/voiced sound classification model, and the distribution of the acoustic features of the speech signal to the unvoiced/voiced sound classification of the corresponding speech frame may be learned. And inputting the extracted acoustic features into the trained unvoiced and voiced sound classification model to obtain the unvoiced and voiced sound classification of each voice frame in the voice signal to be processed.

And 203, correcting the candidate base frequency points based on the unvoiced/voiced sound class corresponding to each voice frame and a preset base frequency screening condition, and determining a base frequency sequence of the voice signal to be processed from the corrected candidate base frequency points by adopting a dynamic programming algorithm.

The preset fundamental frequency screening condition may include an upper and lower boundary constraint condition of the fundamental frequency and/or a continuity constraint condition of the fundamental frequency. The upper and lower boundaries of the fundamental frequency may be determined based on statistics of the fundamental frequency of human utterances, such as a range of about 50Hz-600Hz fundamental frequency for normal humans, with about 50Hz-400Hz fundamental frequency for adults. The fundamental frequency interval can be determined to be [50Hz, 600Hz ], or [50Hz, 400Hz ] according to the screening rule.

The continuity of the fundamental frequency refers to the continuity between the fundamental frequencies of a plurality of continuous speech frames, that is, a curve formed by connecting the fundamental frequency points of each speech frame in a speech signal generally has no abrupt peak point. Continuity constraint conditions can be set in advance according to the continuity of the fundamental frequency, and the candidate fundamental frequency points can be removed when the candidate fundamental frequency points do not meet the preset conditions.

In this embodiment, a fundamental frequency screening condition may be preset based on the characteristic of the fundamental frequency, and the candidate fundamental frequency point of each speech frame obtained in step 201 may be corrected based on the fundamental frequency screening condition and the voiced and unvoiced sound category corresponding to each speech frame.

Specifically, a plurality of candidate base frequency points can be extracted for each speech frame through step 201. The interval in which the fundamental frequency may be located may be determined according to the fundamental frequency screening condition, a filter for filtering out fundamental frequency points corresponding to unvoiced sounds is constructed, candidate fundamental frequency points of voiced sounds are obtained by filtering according to the category of unvoiced and voiced sounds obtained in step 202, and candidate fundamental frequency points that are not in the interval in which the fundamental frequency may be located determined according to the fundamental frequency screening condition are deleted, so that the corrected candidate fundamental frequency points are obtained. Optionally, the fundamental frequency characteristics of different speech frames are different, an interval in which the fundamental frequency of each speech frame may be located may be determined according to the fundamental frequency screening rule, and the candidate base frequency points of the unvoiced speech frames are removed by using the unvoiced and voiced speech categories obtained in step 202, so as to obtain the candidate base frequency points of each voiced speech frame in the corrected speech signal to be processed.

As an example, the candidate base frequency points of a speech frame are 50Hz, 70Hz, 75Hz, 85Hz, and 150 Hz. And determining that the fundamental frequency interval of the speech frame is 70Hz and 100Hz according to the fundamental frequency screening conditions of fundamental frequency continuity and the like, and the speech frame is a voiced frame, deleting the candidate fundamental frequency points 50Hz and 150Hz which are not in the fundamental frequency interval of 70Hz and 100Hz after passing through a filter for filtering fundamental frequency points corresponding to unvoiced sound, and reserving the candidate fundamental frequency points 70Hz, 75Hz and 85Hz as the candidate fundamental frequency points after the correction of the speech frame. In this way, through the preset fundamental frequency screening rule and the filter based on the classification result of the unvoiced and voiced sounds, the candidate fundamental frequency points of the frame of the unvoiced and voiced sounds and the candidate fundamental frequency points with mutation generated by other interfering phonemes (such as environmental noise and the like) can be filtered, and the error rate of half-frequency and frequency multiplication can be reduced.

After the candidate base frequency points corrected by each voice frame are obtained, a dynamic planning algorithm can be adopted to calculate the connection cost among the candidate base frequency points and find out the path with the minimum cost. The sequence formed by the candidate base frequency points passed by the path is the extracted base frequency sequence of the voice signal to be processed.

According to the fundamental frequency extraction method, the candidate fundamental frequency points of each voice frame in the voice signal to be processed are extracted based on the acoustic characteristics of the voice signal to be processed, the voice frame is subjected to unvoiced and voiced classification, unvoiced and voiced categories corresponding to each voice frame are obtained, the candidate fundamental frequency points are corrected based on the unvoiced and voiced categories corresponding to each voice frame and preset fundamental frequency screening conditions, the fundamental frequency sequence of the voice signal to be processed is determined from the corrected candidate fundamental frequency points by adopting a dynamic programming algorithm, effective filtering of half-frequency and frequency doubling waiting fundamental frequency points is achieved, and accuracy of fundamental frequency extraction can be improved.

In some embodiments, the fundamental frequency extraction method may further include a step of training a trained unvoiced/voiced sound classification model based on the sample speech signal labeled with unvoiced/voiced sound classification information of each contained speech frame. The sample speech signal labeled with the unvoiced/voiced category information of each speech frame included therein may be obtained to construct a sample speech signal set. Then, the sample voice signal is extracted based on acoustic features such as zero crossing rate, relative amplitude and relative position of a voice frame of an autocorrelation function/cross-correlation function, the extracted acoustic features are input into an unvoiced and voiced classification model to be trained to obtain a classification result of the unvoiced and voiced classification of the sample voice signal, the classification result of the unvoiced and voiced classification of the sample voice signal is compared with corresponding information labels of the unvoiced and voiced classification of the sample voice signal, if the difference between the unvoiced and voiced classification model and the corresponding information labels of the unvoiced and voiced classification of the sample voice signal does not meet a convergence condition, parameters of the unvoiced and voiced classification model to be trained are adjusted according to the difference between the unvoiced and voiced classification model and the corresponding information labels of the sample voice signal, the sample voice signal is classified again by using the unvoiced and voiced classification model after the parameters are adjusted, and whether the difference between the classification result of the unv, And if the difference between the classification result of the unvoiced and voiced sound classification model after the parameters are adjusted and the unvoiced and voiced sound classification information label of the voice signal meets the preset convergence condition, the trained unvoiced and voiced sound classification model is obtained.

Alternatively, the unvoiced/voiced sound category information of the speech frame included in the sample speech signal may be generated by labeling, by a labeling person, the unvoiced/voiced sound category of each syllable or phoneme of the sample speech signal, splitting the sample speech signal into speech frames corresponding to each syllable or phoneme, and labeling the unvoiced/voiced sound category of the syllable or phoneme as the unvoiced/voiced sound category information of the corresponding speech frame.

Alternatively, the trained unvoiced and voiced classification model may be a logistic regression-based binary classification model. The method has the advantages that the unvoiced and voiced classification model trained based on the machine learning method is adopted to classify the unvoiced and voiced sounds of each voice frame in the voice signal to be processed, and high classification accuracy can be achieved.

With continued reference to fig. 3, a flow chart of another embodiment of a fundamental frequency extraction method according to the present application is shown. As shown in fig. 3, the process 300 of the fundamental frequency extracting method of the present embodiment includes the following steps:

step 301, extracting a candidate base frequency point of each voice frame in the voice signal to be processed based on the acoustic characteristics of the voice signal to be processed.

In this embodiment, an execution subject (for example, a terminal device or a server shown in fig. 1) of the fundamental frequency extraction method may acquire a speech signal to be processed, and perform acoustic feature extraction on the speech signal to be processed, and specifically, may perform calculation analysis on various acoustic features on a time domain signal and/or a frequency domain signal of the speech signal to be processed. Here, the acoustic features may include, but are not limited to, at least one of: autocorrelation function, cross-correlation function, short-time average amplitude difference function. At least one of the following auto-correlation functions and/or cross-correlation functions may also be included: zero crossing rate, energy, peak rate. The acoustic features may also include features in the frequency domain, such as cepstral coefficients, and the like.

The fundamental frequency points can be determined to serve as candidate base frequency points based on statistics of zero crossing rate and peak rate in unit time, or the candidate base frequency points are extracted based on consistency of an autocorrelation function, a short-time average amplitude difference function and a pitch period; wavelet transform, frequency domain filters, etc. may also be employed to extract candidate base frequency points based on the frequency domain features.

In some optional implementation manners of this embodiment, based on the acoustic features of the speech signal to be processed, the candidate fundamental frequency point of each speech frame may be extracted as follows: firstly, down-sampling a voice signal to be processed; then, calculating a peak point of a cross-correlation function based on acoustic characteristics for a voice frame in the voice signal to be processed after down-sampling, and determining a candidate base frequency point corresponding to the voice signal to be processed after down-sampling according to the peak point; and finally mapping the candidate base frequency points corresponding to the voice signals to be processed after down-sampling to the voice signals to be processed to obtain the candidate base frequency points of each voice frame in the voice signals to be processed.

Step 302, performing unvoiced and voiced classification on the voice frames to obtain unvoiced and voiced categories corresponding to the voice frames.

In this embodiment, a plurality of methods may be employed to determine the voiced and unvoiced category of each speech frame in the speech signal to be processed. Here, the unvoiced/voiced category is used to represent whether the corresponding pronunciation of the speech frame is unvoiced or voiced.

In some alternative implementations, unvoiced and voiced speech may be resolved based on parameters such as short-term energy, zero-crossing rate, etc. of the speech signal in the time domain. Unvoiced and voiced sounds may be distinguished by setting a short-time energy threshold and/or a zero-crossing rate threshold.

In other alternative implementations, unvoiced and voiced sounds may be distinguished based on a short-time energy variation of the autocorrelation function and/or a short-time average amplitude difference function. The short-term energy of the autocorrelation function that changes less is unvoiced, and the short-term energy of the autocorrelation function that changes more is voiced. The average amplitude difference function value will drop rapidly in the period of voiced sound, while the short-time average amplitude difference function of unvoiced sound will not drop rapidly, and unvoiced sound and voiced sound can be identified according to whether the short-time average amplitude difference function of unvoiced sound and voiced sound drops rapidly in a short time or not.

In some optional implementation manners, the voiced and unvoiced speech frames in the speech signal to be processed may be classified in the following manner, so as to obtain an unvoiced and voiced category corresponding to each speech frame: and inputting the extracted acoustic features of the voice signal to be processed into the trained unvoiced and voiced sound classification model to obtain unvoiced and voiced sound classification results corresponding to each voice frame in the voice signal to be processed.

Optionally, before the step 302, the flow 300 of the fundamental frequency extraction method may further include: and training to obtain a trained unvoiced and voiced sound classification model based on the sample voice signals marked with the unvoiced and voiced sound classification information of each voice frame. Specifically, the sample voice signal and the labeling information of the voiced and unvoiced classification of each voice frame in the sample voice signal can be obtained, and the parameters of the voiced and unvoiced classification model to be trained are adjusted by adopting a machine learning algorithm, so that the voiced and unvoiced classification result of each voice frame of the sample voice signal by the voiced and unvoiced classification model to be trained tends to be consistent with the labeling information.

Step 301 and step 302 of this embodiment are respectively the same as step 201 and step 202 of the foregoing embodiment, and specific implementation manners of step 301 and step 302 may refer to specific implementation manners of step 201 and step 202 of the foregoing embodiment, which are not described herein again.

Step 303, determining a fundamental frequency candidate interval according to the distribution characteristics of the candidate fundamental frequency points of each speech frame.

The fundamental frequency generally has the following distribution: the candidate base frequencies of each speech frame are usually in a small range, and the distribution of the candidate base frequencies of one speech frame is usually more concentrated and less discrete. If the difference value of two candidate base frequency points of one voice frame exceeds a set threshold value, at least one candidate base frequency point can be determined to be an unreasonable base frequency point.

Specifically, for each speech frame, a plurality of fundamental frequency intervals can be obtained according to the distribution characteristics of the contained fundamental frequency points, and the distance between the end points of each fundamental frequency interval is greater than a preset threshold value. For example, the candidate base frequency points of a speech frame are 80Hz, 82Hz, 85Hz, 118Hz, 120Hz, 125Hz, 151Hz, 153Hz, 156Hz, and 158Hz, and the threshold between the endpoints of the preset base frequency intervals is 20Hz, so that three base frequency intervals are [80Hz,85Hz ], [118Hz,125Hz ], [151Hz, and 158Hz ], respectively. Therefore, when the candidate base frequency points comprise half frequency points or frequency doubling points, the half frequency points, the frequency doubling points and the base frequency points can be divided by dividing the base frequency interval, and the error rate of half frequency and frequency doubling is reduced.

One of the plurality of fundamental frequency intervals of each speech frame may then be selected as a candidate fundamental frequency interval for the speech frame. The interval in which the candidate base frequency point with the maximum peak value of the autocorrelation function is located may be determined as the base frequency candidate interval of the speech frame, or the interval containing the maximum peak value of the autocorrelation function may be determined as the base frequency candidate interval of the speech frame.

And 304, correcting the fundamental frequency candidate interval based on the voiced and unvoiced category of each voice frame to obtain a corrected fundamental frequency candidate interval.

In this embodiment, the candidate base frequency points of the speech frame with unvoiced and voiced sound category determined in step 302 may be deleted to correct the base frequency candidate interval, for example, the base frequency candidate interval is [50Hz,100Hz ], and the candidate base frequency points of the speech frame with unvoiced and voiced sound category are 85Hz, so that the corrected base frequency candidate intervals are [50Hz,85Hz), (85Hz,100Hz "). Thus, the corrected fundamental frequency candidate interval contains the unvoiced and voiced sound information of the voice frame, and the error rate of searching out the fundamental frequency point in the interval is reduced.

And 305, replacing the target candidate base frequency point which is not in the corrected base frequency candidate interval with other candidate base frequency points in the voice frame corresponding to the target candidate base frequency point to obtain the corrected candidate base frequency point, and determining the base frequency sequence of the voice signal to be processed from the corrected candidate base frequency point by adopting a dynamic programming algorithm.

The target candidate base frequency points of each voice frame which are not in the corrected base frequency candidate interval can be removed, and other candidate base frequency points of the voice frame corresponding to the removed target candidate base frequency points are supplemented. Optionally, the removed target candidate fundamental frequency points may be optimal fundamental frequency points screened out according to the distribution characteristics of the fundamental frequency points in the corresponding speech frame, and suboptimal fundamental frequency points of the corresponding speech frame may be supplemented. As an example, if the removed target candidate base frequency point is the base frequency point with the maximum peak value of the autocorrelation function in the corresponding speech frame, the second base frequency point in the sequence from large to small of the peak value of the autocorrelation function of the speech frame may be supplemented. Or, if the removed target candidate base frequency point is the base frequency point in the interval with the maximum number of peak values of the autocorrelation function in the corresponding voice frame, the base frequency point in the interval with the second maximum number of peak values of the autocorrelation function of the voice frame can be supplemented. That is, the target candidate base frequency point of each speech frame not in the corrected base frequency candidate interval may be replaced by another base frequency point, so as to obtain a corrected candidate base frequency point.

And then, calculating the connection cost among the candidate base frequency points by adopting a dynamic programming algorithm, and finding out a path with the minimum cost. The sequence formed by the candidate base frequency points passed by the path is the extracted base frequency sequence of the voice signal to be processed.

As can be seen from fig. 3, in the fundamental frequency extraction method of this embodiment, the fundamental frequency candidate interval is determined according to the distribution characteristic condition of the candidate fundamental frequency points, then the fundamental frequency candidate interval is corrected according to the unvoiced and voiced sound classification result, the candidate fundamental frequency points that are not in the fundamental frequency candidate interval are removed and other candidate fundamental frequency points are supplemented, unreasonable candidate fundamental frequency points can be further removed, and other candidate fundamental frequency points are supplemented for dynamic planning, so that the reliability of the dynamic planning result is not affected by the sharp decrease of the number of the corrected candidate fundamental frequency points.

With further reference to fig. 4, as an implementation of the methods shown in the above figures, the present application provides an embodiment of a fundamental frequency extracting apparatus, which corresponds to the embodiments of the methods shown in fig. 2 and fig. 3, and which can be applied in various electronic devices.

As shown in fig. 4, the fundamental frequency extracting apparatus 400 of the present embodiment includes: an extraction unit 401, a classification unit 402, and a determination unit 403. The extraction unit 401 is configured to extract a candidate base frequency point of each speech frame in the speech signal to be processed based on the acoustic features of the speech signal to be processed; the classifying unit 402 is configured to perform unvoiced and voiced classification on the voice frames to obtain unvoiced and voiced categories corresponding to the voice frames; the determining unit 403 is configured to modify the candidate base frequency points based on the unvoiced/voiced sound category corresponding to each speech frame and a preset base frequency screening condition, and determine a base frequency sequence of the speech signal to be processed from the modified candidate base frequency points by using a dynamic programming algorithm.

In some embodiments, the classifying unit 402 may be further configured to perform unvoiced-voiced classification on the speech frames to obtain an unvoiced-voiced class corresponding to each speech frame as follows: and inputting the extracted acoustic features of the voice signal to be processed into the trained unvoiced and voiced sound classification model to obtain unvoiced and voiced sound classification results corresponding to each voice frame in the voice signal to be processed.

In some embodiments, the apparatus 400 may further include: and the training unit is configured to train and obtain a trained unvoiced and voiced sound classification model based on the sample voice signal labeled with the unvoiced and voiced sound classification information of each voice frame.

In some embodiments, the extracting unit 401 may be further configured to extract the candidate base frequency point of each speech frame in the speech signal to be processed according to the following manner, based on the acoustic features of the speech signal to be processed: down-sampling the voice signal to be processed; calculating a peak point of a cross-correlation function based on acoustic characteristics for a voice frame in the voice signal to be processed after down-sampling, and determining a candidate base frequency point corresponding to the voice signal to be processed after down-sampling according to the peak point; and mapping the candidate base frequency point corresponding to the voice signal to be processed after down-sampling to the voice signal to be processed to obtain the candidate base frequency point of each voice frame in the voice signal to be processed.

In some embodiments, the determining unit 403 may be further configured to modify the candidate base frequency point according to the unvoiced/voiced sound class corresponding to each speech frame and a preset base frequency screening condition as follows: determining a fundamental frequency candidate interval according to the distribution characteristics of the candidate fundamental frequency points of each voice frame; correcting the fundamental frequency candidate interval based on the voiced and unvoiced class of each voice frame to obtain a corrected fundamental frequency candidate interval; and replacing the target candidate base frequency point which is not in the corrected base frequency candidate interval with other candidate base frequency points in the voice frame corresponding to the target candidate base frequency point to obtain the corrected candidate base frequency point.

It should be understood that the elements recited in apparatus 400 correspond to various steps in the methods described with reference to fig. 2 and 3. Thus, the operations and features described above for the method are equally applicable to the apparatus 400 and the units included therein, and are not described in detail here.

The fundamental frequency extracting apparatus 400 according to the above embodiment of the present application extracts candidate fundamental frequency points of each speech frame in a speech signal to be processed, based on acoustic features of the speech signal to be processed; carrying out unvoiced and voiced classification on the voice frames to obtain unvoiced and voiced categories corresponding to the voice frames; and modifying the candidate base frequency points based on the unvoiced and voiced sound classes corresponding to the voice frames and the preset base frequency screening conditions, and determining a base frequency sequence of the voice signal to be processed from the modified candidate base frequency points by adopting a dynamic programming algorithm, so that unreasonable points in the candidate base frequency points can be effectively filtered, the error rate of frequency multiplication and half frequency is reduced, and the accuracy of base frequency extraction is improved.

Referring now to FIG. 5, shown is a block diagram of a computer system 500 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 501. It should be noted that the computer readable medium of the present application can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an extraction unit, a classification unit, and a determination unit. The names of the units do not form a limitation on the units themselves in some cases, for example, the extraction unit may also be described as a "unit that extracts a candidate base frequency point of each speech frame in a speech signal to be processed based on an acoustic feature of the speech signal to be processed".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: extracting candidate base frequency points of each voice frame in the voice signal to be processed based on the acoustic characteristics of the voice signal to be processed; carrying out unvoiced and voiced classification on the voice frames to obtain unvoiced and voiced categories corresponding to the voice frames; and correcting the candidate base frequency points based on the unvoiced and voiced sound classes corresponding to the voice frames and preset base frequency screening conditions, and determining a base frequency sequence of the voice signal to be processed from the corrected candidate base frequency points by adopting a dynamic programming algorithm.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A fundamental frequency extraction method, comprising:

extracting candidate base frequency points of each voice frame in the voice signal to be processed based on the acoustic characteristics of the voice signal to be processed;

carrying out unvoiced and voiced sound classification on the voice frames to obtain unvoiced and voiced sound classes corresponding to the voice frames;

modifying the candidate base frequency points based on the unvoiced and voiced sound classes corresponding to the voice frames and preset base frequency screening conditions, and determining a base frequency sequence of the voice signal to be processed from the modified candidate base frequency points by adopting a dynamic programming algorithm;

wherein, the modifying the candidate base frequency point based on the unvoiced/voiced sound class corresponding to each voice frame and the preset base frequency screening condition comprises:

determining a fundamental frequency candidate interval according to the fundamental frequency screening condition;

and correcting the fundamental frequency candidate interval based on the voiced and unvoiced sound category of each voice frame to obtain a corrected fundamental frequency candidate interval, and correcting the candidate fundamental frequency point according to the corrected fundamental frequency candidate interval.

2. The method of claim 1, wherein the performing voiced and unvoiced classification on the speech frames to obtain an unvoiced and voiced classification corresponding to each of the speech frames comprises:

and inputting the extracted acoustic features of the voice signal to be processed into a trained unvoiced and voiced sound classification model to obtain unvoiced and voiced sound classification results corresponding to voice frames in the voice signal to be processed.

3. The method of claim 2, wherein the method further comprises:

and training to obtain the trained unvoiced and voiced sound classification model based on the sample voice signals marked with the unvoiced and voiced sound classification information of each voice frame.

4. The method according to claim 1, wherein the extracting the candidate base frequency points of each speech frame in the speech signal to be processed based on the acoustic features of the speech signal to be processed comprises:

down-sampling the voice signal to be processed;

calculating a peak point of a cross-correlation function for a voice frame in the voice signal to be processed after the voice frame is subjected to down-sampling based on the acoustic features, and determining a candidate base frequency point corresponding to the voice signal to be processed after the voice frame is subjected to down-sampling according to the peak point;

mapping the candidate base frequency point corresponding to the voice signal to be processed after down-sampling to the voice signal to be processed to obtain the candidate base frequency point of each voice frame in the voice signal to be processed.

5. The method according to any one of claims 1-4, wherein said determining a candidate interval of fundamental frequencies according to said screening condition of fundamental frequencies comprises:

determining a fundamental frequency candidate interval according to the distribution characteristics of the candidate fundamental frequency points of each voice frame; and

the step of correcting the candidate base frequency point according to the corrected base frequency candidate interval comprises the following steps:

and replacing the target candidate base frequency point which is not in the corrected base frequency candidate interval with other candidate base frequency points in the voice frame corresponding to the target candidate base frequency point to obtain the corrected candidate base frequency point.

6. A fundamental frequency extraction apparatus comprising:

the extraction unit is configured to extract candidate base frequency points of each voice frame in the voice signal to be processed based on the acoustic characteristics of the voice signal to be processed;

the classification unit is configured to perform unvoiced and voiced classification on the voice frames to obtain unvoiced and voiced sound classes corresponding to the voice frames;

the determining unit is configured to modify the candidate base frequency points based on the unvoiced/voiced sound classes corresponding to the voice frames and preset base frequency screening conditions, and determine a base frequency sequence of the voice signal to be processed from the modified candidate base frequency points by adopting a dynamic programming algorithm;

wherein the determining unit is configured to correct the candidate base frequency point as follows:

7. The apparatus of claim 6, wherein the classification unit is further configured to classify the speech frames as unvoiced and voiced, resulting in a voiced and unvoiced class for each of the speech frames, as follows:

8. The apparatus of claim 7, wherein the apparatus further comprises:

and the training unit is configured to train the trained unvoiced and voiced sound classification model based on the sample voice signal labeled with the unvoiced and voiced sound classification information of each voice frame.

9. The apparatus according to claim 6, wherein the extracting unit is further configured to extract the candidate fundamental frequency point of each speech frame in the speech signal to be processed based on the acoustic features of the speech signal to be processed as follows:

down-sampling the voice signal to be processed;

10. The apparatus according to any of claims 6-9, wherein the determining unit is further configured to:

determining a fundamental frequency candidate interval according to the distribution characteristics of the candidate fundamental frequency points of each voice frame;

and the determining unit is further configured to correct the candidate base frequency point as follows:

11. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

12. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-5.