CN116543751A

CN116543751A - Voice feature extraction method and device, electronic equipment and storage medium

Info

Publication number: CN116543751A
Application number: CN202210089814.1A
Authority: CN
Inventors: 高万林; 刘利涛; 肖宛昂; 廖春; 熊琪琳
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2023-08-04

Abstract

The invention provides a voice feature extraction method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: performing voice endpoint detection on a voice fragment to be processed to obtain a start frame and an end frame of the voice fragment to be processed; extracting voice characteristics of the voice fragments to be processed to obtain Mel frequency cepstrum coefficient characteristic parameters related to the voice fragments to be processed; and based on the starting frame and the ending frame, intercepting the characteristic parameters of the mel-frequency cepstrum coefficient of the voice fragment to be processed to obtain the characteristic parameters of the effective mel-frequency cepstrum coefficient of the voice fragment to be processed. The voice characteristic extraction method provided by the invention realizes that invalid voice signal characteristic parameters are removed in the process of extracting voice signal characteristic parameters, thereby improving the efficiency of voice signal recognition.

Description

Voice feature extraction method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a method and apparatus for extracting speech features, an electronic device, and a storage medium.

Background

In the speech signal recognition process, the extraction of characteristic parameters is a key problem of speech recognition accuracy. Also, for speech signals, the validity of extracting feature parameters has an important role in speech recognition.

As known in the related art, mel-frequency cepstral coefficients (Mel Frequency Cepstral Coefficents, also called MFCCs) or linear predictive cepstral coefficients (linear predictive cepstral coefficient, also called LPCCs) are often utilized in the speech signal recognition process. The characteristic parameters which are invalid in the process of extracting the characteristic parameters are often doped, so that the efficiency of voice signal recognition is reduced.

Disclosure of Invention

The invention provides a voice characteristic extraction method, a voice characteristic extraction device, electronic equipment and a storage medium, which are used for solving the defect that invalid characteristic parameters are doped in the characteristic parameter extraction process in the prior art, and eliminating the invalid voice signal characteristic parameters in the voice signal characteristic parameter extraction process, so that the voice signal recognition efficiency is improved.

The invention provides a voice characteristic extraction method, which comprises the following steps: performing voice endpoint detection on a voice fragment to be processed to obtain a start frame and an end frame of the voice fragment to be processed; extracting voice characteristics of the voice fragments to be processed to obtain Mel frequency cepstrum coefficient characteristic parameters related to the voice fragments to be processed; and based on the starting frame and the ending frame, intercepting the characteristic parameters of the mel-frequency cepstrum coefficient of the voice fragment to be processed to obtain the characteristic parameters of the effective mel-frequency cepstrum coefficient of the voice fragment to be processed.

According to the method for extracting voice features provided by the invention, the voice endpoint detection is carried out on the voice fragment to be processed to obtain a start frame and an end frame of the voice fragment to be processed, and the method comprises the following steps: carrying out framing treatment on the voice fragments to be treated to obtain voice fragments after framing treatment; performing short-time zero-crossing rate calculation on each frame in the frame-by-frame processed voice fragments to obtain the voice signal waveform zero-crossing times of each frame, and performing short-time energy calculation on each frame in the frame-by-frame processed voice fragments to obtain the sum of absolute values of voice signal energy values of each frame; and performing voice endpoint detection on the voice fragments to be processed based on the sum of the zero crossing times of the voice signal waveform of each frame and the absolute value of the voice signal energy value of each frame to obtain a start frame and an end frame of the voice fragments to be processed.

According to the method for extracting voice features provided by the invention, the method for detecting voice end points of the voice fragments to be processed based on the sum of the zero crossing times of the voice signal waveform of each frame and the absolute value of the voice signal energy value of each frame to obtain the starting frame and the ending frame of the voice fragments to be processed comprises the following steps: determining a short-time zero crossing rate threshold, a short-time energy threshold and a voice fragment length threshold; determining a start frame of the to-be-processed voice segment based on a sum of absolute values of the voice signal energy values of each frame in the frame-processed voice segment and the short-time energy threshold in a time sequence, and determining a voice segment length of the to-be-processed voice segment based on the number of zero crossings of the voice signal waveform of each frame in the frame-processed voice segment and the short-time zero crossing rate threshold in the time sequence; and determining the effective voice fragment length of the voice fragment to be processed based on the voice fragment length and the voice fragment length threshold, and taking the corresponding last frame in the effective voice fragment length as the ending frame of the voice fragment to be processed.

According to the method for extracting voice features provided by the invention, before the framing processing is performed on the voice segment to be processed, the method further comprises the following steps: and carrying out normalization processing on the voice fragment to be processed, and taking the voice fragment to be processed after normalization processing as a final voice fragment to be processed.

According to the method for extracting voice features provided by the invention, after the voice fragments to be processed are subjected to framing processing to obtain the voice fragments after framing processing, the method further comprises the following steps: and carrying out first-order digital filtering processing on the voice fragments after framing processing, and taking the voice fragments after framing processing after first-order digital filtering processing as final voice fragments after framing processing.

According to the method for extracting voice features provided by the invention, the voice features of the voice segment to be processed are extracted to obtain the mel frequency cepstrum coefficient feature parameters related to the voice segment to be processed, and the method comprises the following steps: carrying out framing treatment on the voice fragments to be treated to obtain voice fragments after framing treatment; windowing is carried out on each frame in the voice fragments after framing to obtain voice fragments after windowing; processing the windowed voice segment based on discrete Fourier transform to obtain a Fourier transformed voice segment and an energy value complex number corresponding to the Fourier transformed voice segment; performing multi-order Mel filtering on the energy value complex number to obtain Mel values corresponding to the voice fragments to be processed; and obtaining the characteristic parameters of the mel frequency cepstrum coefficient of the voice fragment to be processed through discrete chord transformation based on the mel value.

According to the method for extracting voice features provided by the invention, the multi-order mel filtering is performed on the energy value complex number to obtain the mel value corresponding to the voice segment to be processed, and the method comprises the following steps: determining function values of twenty-fourth order mel filter groups; based on the complex energy value and the function value of the twenty-fourth order Mel filter group, obtaining Mel values of each frame in the voice segment to be processed; and obtaining the Mel value corresponding to the voice fragment to be processed based on the Mel value of each frame.

According to the method for extracting voice features provided by the invention, before the framing processing is performed on the voice segment to be processed, the method further comprises the following steps: and carrying out first-order digital filtering processing on the voice fragments after framing processing, and taking the voice fragments after framing processing after first-order digital filtering processing as final voice fragments after framing processing.

The invention also provides a voice characteristic extraction device, which comprises: the detection module is used for detecting the voice end point of the voice fragment to be processed to obtain a start frame and an end frame of the voice fragment to be processed; the extraction module is used for extracting voice characteristics of the voice fragments to be processed to obtain Mel frequency cepstrum coefficient characteristic parameters related to the voice fragments to be processed; and the processing module is used for intercepting the characteristic parameters of the mel frequency cepstrum coefficient of the voice fragment to be processed based on the starting frame and the ending frame to obtain the characteristic parameters of the effective mel frequency cepstrum coefficient of the voice fragment to be processed.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the speech feature extraction method as described in any of the above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the speech feature extraction method as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements the steps of a method of speech feature extraction as described in any of the above.

According to the voice feature extraction method, the voice feature extraction device, the electronic equipment and the storage medium, the acquired start frame and the acquired end frame of the voice fragment to be processed are used for intercepting the characteristic parameters of the mel frequency cepstrum coefficient of the voice fragment to be processed, so that the characteristic parameters of the effective mel frequency cepstrum coefficient of the voice fragment to be processed can be acquired. The method and the device realize that invalid voice signal characteristic parameters are removed in the process of extracting voice signal characteristic parameters, so that the voice signal recognition efficiency is improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a speech feature extraction method according to the present invention;

FIG. 2 is a schematic flow chart of detecting a voice endpoint of a voice segment to be processed according to the present invention, so as to obtain a start frame and an end frame of the voice segment to be processed;

FIG. 3 is a schematic flow chart of framing an input speech signal according to the present invention;

FIG. 4 is a schematic flow chart of detecting a voice endpoint of a voice segment to be processed based on the sum of the number of zero crossings of the voice signal waveform of each frame and the absolute value of the voice signal energy value of each frame;

FIG. 5 is a schematic diagram of the structure of determining a start frame and an end frame provided by the present invention;

FIG. 6 is a flowchart of extracting features of a voice segment to be processed to obtain parameters related to the mel-frequency cepstrum coefficient of the voice segment to be processed according to the present invention;

FIG. 7 is a schematic diagram of a first order digital filtering process for a single frame speech signal according to the present invention;

FIG. 8 is a schematic diagram of a voice feature extraction device according to the present invention;

fig. 9 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the voice signal recognition process, the application of end point detection can improve the speed and accuracy of the whole voice signal recognition system. The endpoint detection technique can find the starting point and the ending point of the voice signal by analyzing and judging the input voice signal by using some digital techniques. Accurately finding both endpoints of speech has a great impact on subsequent signal processing and applications. Endpoint detection not only reduces processing time and data storage space, but also has an effect on filtering noise and eliminating silence segments.

The invention provides a voice feature extraction method, which can directly extract effective mel frequency cepstrum coefficient feature parameters of a voice segment from a continuously input voice signal by executing voice endpoint detection and voice feature extraction in parallel, thereby facilitating subsequent voice recognition.

The present invention will be described with reference to the following examples of the process of the speech feature extraction method.

Fig. 1 is a schematic flow chart of a speech feature extraction method provided by the present invention.

In an exemplary embodiment of the present invention, as shown in fig. 1, the speech feature extraction method may include steps 110 to 130, and each step will be described below.

In step 110, a voice endpoint detection is performed on the to-be-processed voice segment to obtain a start frame and an end frame for the to-be-processed voice segment.

In one embodiment, a segment of speech to be processed may be obtained, wherein the segment of speech to be processed may be a continuous segment of speech. In the application process, the voice endpoint detection can be performed on the voice segment to be processed to obtain effective starting frames and ending frames of the voice segment to be processed, so that invalid voice segment frames can be removed.

In step 120, the speech feature extraction is performed on the speech segment to be processed, so as to obtain mel-frequency cepstrum coefficient feature parameters related to the speech segment to be processed.

In one embodiment, the voice feature extraction may be performed on the voice segment to be processed, so as to obtain mel-frequency cepstrum coefficient feature parameters related to the voice segment to be processed.

It should be noted that Mel-frequency cepstrum coefficient (Mel-scale Frequency Cepstral Coefficients, also called MFCC) is the most common speech feature in modern speech recognition, and in contrast to other parameters, MFCC feature parameters fully consider the auditory perception condition of human ears, and are now the most widely used and successful feature parameters. Its analysis is based on modeling the auditory characteristics of the human ear. The relation between the level of various sounds heard by the human ear and the frequency of the sounds is not a normal linear proportional relation, the mel frequency scale is more superior in meeting the auditory characteristics of the human ear, and the value of the mel frequency scale corresponds to the practical frequency logarithmic distribution relation in a general way. The specific relation between the Mel frequency and the actual frequency is expressed as the following formula: mel (f) =2595 lg (1+f/700), f being the actual frequency in Hz. The critical frequency bandwidth is transformed along with the frequency transformation, is consistent with the increase of Mel frequency, is linearly distributed below 1000Hz, and has a bandwidth of about 100 Hz; logarithmic growth is seen above 1000 Hz. Similar to the division of critical frequency bands, the voice frequency can be divided into a series of triangular filter sequences, namely mel filter banks, taking the weighted sum of all signal amplitudes in the frequency bandwidth of each triangular filter as the output of each band-pass filter, then performing logarithmic operation on the output of all filters, and further performing Discrete Cosine Transform (DCT) to obtain the MFCC characteristic parameters.

In step 130, based on the start frame and the end frame, the mel-frequency cepstrum coefficient characteristic parameters of the to-be-processed voice segment are intercepted, so as to obtain the effective mel-frequency cepstrum coefficient characteristic parameters of the to-be-processed voice segment.

In one embodiment, the mel-frequency cepstrum coefficient characteristic parameter of the entire continuous speech segment to be processed may be truncated based on the determined valid start frame and end frame to obtain the valid mel-frequency cepstrum coefficient characteristic parameter for the speech segment to be processed.

According to the voice characteristic extraction method provided by the invention, the acquired initial frame and end frame of the voice fragment to be processed are used for intercepting the characteristic parameters of the mel frequency cepstrum coefficient of the voice fragment to be processed, so that the characteristic parameters of the effective mel frequency cepstrum coefficient of the voice fragment to be processed can be acquired. The method and the device realize that invalid voice signal characteristic parameters are removed in the process of extracting voice signal characteristic parameters, so that the voice signal recognition efficiency is improved.

In order to further describe the speech feature extraction method provided by the present invention, the following description will be given with reference to the following embodiments.

Fig. 2 is a schematic flow chart of detecting a voice endpoint of a voice segment to be processed according to the present invention, so as to obtain a start frame and an end frame of the voice segment to be processed.

In an exemplary embodiment of the present invention, as shown in fig. 2, performing voice endpoint detection on a to-be-processed voice segment to obtain a start frame and an end frame about the to-be-processed voice segment may include steps 210 to 240, and each step will be described below.

In step 210, the speech segment to be processed is subjected to framing processing, so as to obtain a speech segment after framing processing.

In one embodiment, the speech segments to be processed may be subjected to framing, for example, overlapping segmentation, to obtain the speech segments after framing.

In one embodiment, a double-ended random access memory (Random Access Memory, also known as RAM) and state machine control may be used for voice framing. In the application process, the read address and the write address can be respectively generated under the control of the read-write clock, data is read from the RAM pointed by the read address when the signal is read, and the voice signal is stored in the RAM. The RAM is read and counted with a count1 signal, when count1 = frame length, count1 is cleared, and signals count2+1, count2 are frame count signals, while the read address is rolled back to continue reading signals. As can be seen from fig. 3, the backoff length lenack= (frame length-frame shift), and when count2=maximum frame number, the framing process is completed.

In step 220, short-time zero-crossing rate calculation is performed on each frame in the voice segment after frame segmentation, so as to obtain the number of zero crossings of the voice signal waveform of each frame.

In step 230, short-time energy calculation is performed on each frame in the segmented speech segment to obtain the sum of absolute values of the energy values of the speech signal of each frame.

In one embodiment, the number of zero crossings of the speech signal waveform for each frame in the framed speech segment may be calculated, as well as the sum of the absolute values of the speech signal energy values for each frame in the framed speech segment.

In step 240, the speech end point detection is performed on the speech segment to be processed based on the sum of the number of zero crossings of the speech signal waveform of each frame and the absolute value of the speech signal energy value of each frame, so as to obtain a start frame and an end frame for the speech segment to be processed.

By determining the effective start frame and the effective end frame of the voice segment to be processed according to the embodiment, a foundation can be laid for extracting the effective voice signal characteristic parameters of the voice segment to be processed, so that the voice signal recognition efficiency is improved.

In one embodiment, as shown in fig. 4, performing voice endpoint detection on a to-be-processed voice segment based on the sum of the number of zero crossings of the voice signal waveform of each frame and the absolute value of the voice signal energy value of each frame, to obtain a start frame and an end frame for the to-be-processed voice segment may include steps 410 to 440, which will be described below.

In step 410, a short zero crossing rate threshold, a short energy threshold, and a speech segment length threshold are determined.

In step 420, a starting frame of the speech segment to be processed is determined in time order based on the sum of the absolute values of the speech signal energy values of each frame in the speech segment after framing processing and the short-time energy threshold.

In step 430, the speech segment length of the speech segment to be processed is determined based on the number of zero crossings of the speech signal waveform for each frame in the speech segment after framing processing and the short zero crossing rate threshold in time order.

In step 440, the effective speech segment length of the speech segment to be processed is determined based on the speech segment length and the speech segment length threshold, and the corresponding last frame in the effective speech segment length is used as the ending frame of the speech segment to be processed.

In one embodiment, the following steps may be used to obtain the start frame and the end frame for the speech segment to be processed, where:

in step 1, two energy thresholds amp 1=min (Q1, max (amp)/4) and amp 2=min (Q2, max (amp)/8) are calculated, where Q1 and Q2 may be fixed values, and in this embodiment, Q1 and Q2 are not specifically limited, and for example, Q1 and Q2 may be 10 and 2, respectively. It is understood that the energy thresholds amp1 and amp2 may correspond to the short-time energy thresholds described above.

Step 2, inputting the voice signals according to the time sequence, judging the input voice signals (corresponding to the voice fragments after framing), and when the amps (corresponding to the sum of absolute values of the voice signal energy values) of the input frames in the voice fragments after framing>at amp1, the recording frame number is the start frame X ₁ Setting a mute frame number Silence=0, and entering a step 3; otherwise, repeating the step 2.

Step 3, counting the input frames by count, when amp > amp2 and zcr (corresponding to the number of zero crossings of the speech signal waveform described above) > zcr2, count+1, and repeating step 3; otherwise, silence+1, go to step 4. Where zcr2 corresponds to the short zero crossing rate threshold described above, in one example, the short zero crossing rate threshold zcr2 may be determined by calculating the average of the short zero crossing rates of the previous 5 frames.

Step 4, judging the Silence, and when the Silence is smaller than the maxsilence, counting+1 and repeatedly executing the step 3; otherwise, step 5 is entered.

Step 5, judging the count, when the count is smaller than the minimum voice segment length minlen (corresponding to the voice segment length threshold), considering that the voice segment length is insufficient and can be regarded as noise, at the moment, emptying the initial frame, setting Silence, count to zero, and entering step 2; otherwise, step 6 is entered.

Step 6, finishing the voice detectionRecording end frame X ₂ And respectively output the start frames X ₁ And end frame X ₂ Is used for finishing the end point detection.

In yet another example, as can be seen in connection with fig. 5, the endpoint detection process may be performed in a time-sequential manner for the speech segments to be processed. If the energy of the input frame is greater than a low energy threshold (e.g., amp 2), it may be determined that the input frame is not a silence frame, and if the energy of the input frame is less than the low energy threshold, it may be determined that the input frame is a silence frame. Further, if the energy of the input frame is greater than a high energy threshold (e.g., amp 1), and the number of frames of the speech segment corresponding to the input frame is determined to be greater than a frame number threshold, it may be determined that the speech segment is not an invalid speech segment, and the corresponding input frame may be regarded as a start frame. And sequentially processing the voice fragments to be processed according to the method, and recording the last frame as an end frame, wherein the energy of the end frame is larger than a low energy threshold and the corresponding zero crossing rate is larger than a zero crossing rate threshold.

In an exemplary embodiment of the present invention, continuing to describe the foregoing embodiment, before the framing processing is performed on the to-be-processed speech segment, the speech feature extraction method may further include: and carrying out normalization processing on the voice fragments to be processed, and taking the voice fragments to be processed after normalization processing as final voice fragments to be processed. The normalization processing can facilitate the processing of the voice fragments to be processed, and the operation cost is reduced.

In an exemplary embodiment of the present invention, continuing to describe the foregoing embodiment as an example, after performing frame segmentation on the to-be-processed speech segment to obtain a speech segment after frame segmentation, the speech feature extraction method may further include: and carrying out first-order digital filtering processing on the frame-processed voice fragments, and taking the frame-processed voice fragments subjected to the first-order digital filtering processing as final frame-processed voice fragments.

In one embodiment, the frame-split speech segment may be processed according to formula (1) to obtain a frame-split speech segment after the first-order digital filtering process.

y(n)＝x(n)-z*x(n-1) (1)

Where x (n) represents a current signal in the speech segment to be processed, x (n-1) represents a signal of the previous clock cycle stored in the register, z represents a preset constant, for example, z=0.9375, and y (n) represents a frame-processed speech segment after the first-order digital filtering process. The voice segment can be smoother through the embodiment, and a foundation is laid for accurately determining the starting frame and the ending frame of the voice segment to be processed.

Fig. 6 is a schematic flow chart of extracting voice characteristics of a voice segment to be processed to obtain mel frequency cepstrum coefficient characteristic parameters related to the voice segment to be processed.

In an exemplary embodiment of the present invention, as shown in fig. 6, performing speech feature extraction on a to-be-processed speech segment to obtain mel-frequency cepstrum coefficient feature parameters about the to-be-processed speech segment may include steps 610 to 650, and each step will be described below.

In step 610, the speech segment to be processed is subjected to framing processing, so as to obtain a speech segment after framing processing.

In step 620, each frame in the segmented speech segment is windowed to obtain a windowed speech segment.

In one embodiment, as can be seen in fig. 7, each frame of data in the framed speech segment may be multiplied by a hamming window function value in the rom to obtain a windowed speech segment. The amplitude of each frame signal can be gradually changed to 0 at two ends by windowing the voice fragments after framing, so that Fourier transformation is facilitated, and frequency spectrum leakage is reduced.

In step 630, the windowed speech segment is processed based on the discrete fourier transform to obtain a fourier transformed speech segment and an energy value complex corresponding to the fourier transformed speech segment.

In one embodiment, the windowed single frame time domain signal may be subjected to discrete fourier transform within a single frame frequency domain to obtain single frame frequency domain information of the single frame time domain signal. The single frame frequency domain information may include a post-fourier-transform speech segment and an energy value complex number corresponding to the post-fourier-transform speech segment.

In step 640, the energy value complex numbers are subjected to multi-stage mel filtering to obtain mel values corresponding to the speech segments to be processed.

In one embodiment, the speech signal is fourier transformed to obtain values for both real and imaginary parts. The method comprises the steps of firstly completing square sum operation of data of a real part and an imaginary part by adopting two multipliers and an adder, then opening a root by adopting a square root opening ip core to obtain a frequency domain energy value, and finally carrying out multi-order Mel filtering on the energy value complex number by adopting an MEL filter to obtain a Mel value corresponding to a voice fragment to be processed.

The MEL filter is essentially a group of triangular filters, the coefficients of the MEL filter are fixed in a system, a read-only memory is used for storing the coefficients of the triangular filters and the frequency points, the energy value of a frequency domain is multiplied by the function value of a 24-order Mel filter group pre-stored in the read-only memory, the multiplied values of the whole frame of signals are accumulated, and finally 24 Mel values are obtained per frame.

In step 650, based on the mel values, the mel frequency cepstrum coefficient characteristic parameters for the speech segment to be processed are obtained by discrete chord transformation.

In one embodiment, the mel value may be subjected to a logarithmic and discrete cosine transform operation to obtain a mel frequency cepstrum coefficient characteristic parameter for the speech segment to be processed. In an example, the input data may be subjected to an ip kernel with a log ln, and the obtained result is multiplied by a cosine coefficient and accumulated for calculation. In the application process, because the cosine coefficient parameters are few, registers can be used for storage, and the cosine coefficient parameters can be stored conveniently.

In yet another example, further, the result of the accumulation calculation may be intercepted by using the voice start frame and the voice end frame obtained before, so as to finally obtain the MFCC characteristic parameters of the effective voice segment of the voice signal.

The present invention will be described with reference to the following examples of a procedure for performing multi-stage mel filtering on a complex number of energy values to obtain mel values corresponding to a speech segment to be processed.

In one embodiment, the multi-order mel filtering of the energy value complex number to obtain the mel value corresponding to the speech segment to be processed may be implemented by:

determining function values of twenty-fourth order mel filter groups;

based on the complex number of energy values and the function value of the twenty-fourth order Mel filtering group, obtaining Mel values of each frame in the voice fragment to be processed;

based on the Mel value of each frame, the Mel value corresponding to the voice segment to be processed is obtained.

In this embodiment, the function value of the twenty-four-order mel filter group has a smaller capacity, and determining the mel value corresponding to the speech segment to be processed by the function value of the twenty-four-order mel filter group can reduce the amount of computation, thereby improving the processing efficiency.

In an exemplary embodiment of the present invention, continuing to describe the foregoing embodiment as an example, before framing a to-be-processed speech segment, the speech feature extraction method may further include: and carrying out first-order digital filtering processing on the frame-processed voice fragments, and taking the frame-processed voice fragments subjected to the first-order digital filtering processing as final frame-processed voice fragments.

In one embodiment, the frame-split speech segment may be processed according to formula (2) to obtain a frame-split speech segment after the first-order digital filtering process.

y(n)＝x(n)-z*x(n-1) (2)

Where x (n) represents a current signal in the speech segment to be processed, x (n-1) represents a signal of the previous clock cycle stored in the register, z represents a preset constant, for example, z=0.9375, and y (n) represents a frame-processed speech segment after the first-order digital filtering process. The voice fragments can be smoother through the embodiment, and a foundation is laid for improving the quality of the characteristic parameters of the mel frequency cepstrum coefficient of the voice fragments to be processed.

According to the above description, according to the voice feature extraction method provided by the invention, the mel-frequency cepstrum coefficient feature parameters of the voice fragment to be processed are intercepted through the obtained start frame and end frame of the voice fragment to be processed, so that the effective mel-frequency cepstrum coefficient feature parameters of the voice fragment to be processed can be obtained. The method and the device realize that invalid voice signal characteristic parameters are removed in the process of extracting voice signal characteristic parameters, so that the voice signal recognition efficiency is improved.

Based on the same conception, the invention also provides a voice characteristic extraction device.

The voice feature extraction device provided by the invention is described below, and the voice feature extraction device described below and the voice feature extraction method described above can be referred to correspondingly.

Fig. 8 is a schematic structural diagram of a voice feature extraction device provided by the invention.

In an exemplary embodiment of the present invention, as shown in fig. 8, the voice feature extraction apparatus may include a detection module 810, an extraction module 820, and a processing module 830, which will be described below separately.

The detection module 810 may be configured to perform voice endpoint detection on the to-be-processed voice segment, resulting in a start frame and an end frame for the to-be-processed voice segment.

The extraction module 820 may be configured to perform speech feature extraction on the speech segment to be processed, resulting in mel-frequency cepstrum coefficient feature parameters for the speech segment to be processed.

The processing module 830 may be configured to intercept the mel-frequency cepstral coefficient feature parameter of the speech segment to be processed based on the start frame and the end frame to obtain a valid mel-frequency cepstral coefficient feature parameter for the speech segment to be processed.

In an exemplary embodiment of the present invention, the detection module 810 may perform voice endpoint detection on the to-be-processed voice segment in the following manner, to obtain a start frame and an end frame related to the to-be-processed voice segment: carrying out framing treatment on the voice fragments to be treated to obtain voice fragments after framing treatment; carrying out short-time zero-crossing rate calculation on each frame in the voice fragments after framing treatment to obtain the number of zero crossings of the voice signal waveform of each frame, and carrying out short-time energy calculation on each frame in the voice fragments after framing treatment to obtain the sum of absolute values of the voice signal energy values of each frame; and carrying out voice endpoint detection on the voice fragments to be processed based on the sum of the zero crossing times of the voice signal waveform of each frame and the absolute value of the voice signal energy value of each frame, and obtaining a start frame and an end frame of the voice fragments to be processed.

In an exemplary embodiment of the present invention, the detection module 810 may perform the voice endpoint detection on the to-be-processed voice segment based on the sum of the number of zero crossings of the voice signal waveform of each frame and the absolute value of the voice signal energy value of each frame, to obtain the start frame and the end frame of the to-be-processed voice segment: determining a short-time zero crossing rate threshold, a short-time energy threshold and a voice fragment length threshold; determining a start frame of the voice segment to be processed based on the sum of absolute values of voice signal energy values of each frame in the voice segment after framing processing and a short-time energy threshold value according to a time sequence, and determining the voice segment length of the voice segment to be processed based on the number of zero crossings of the voice signal waveform of each frame in the voice segment after framing processing and the short-time zero crossing rate threshold value according to the time sequence; and determining the effective voice fragment length of the voice fragment to be processed based on the voice fragment length and the voice fragment length threshold value, and taking the corresponding last frame in the effective voice fragment length as the ending frame of the voice fragment to be processed.

In an exemplary embodiment of the present invention, the detection module 810 may be further configured to normalize the to-be-processed speech segment, and take the normalized to-be-processed speech segment as a final to-be-processed speech segment.

In an exemplary embodiment of the present invention, the detection module 810 may be further configured to perform a first-order digital filtering process on the frame-processed speech segment, and use the frame-processed speech segment after the first-order digital filtering process as a final frame-processed speech segment.

In an exemplary embodiment of the present invention, the extraction module 820 may perform the speech feature extraction on the to-be-processed speech segment to obtain mel-frequency cepstral coefficient feature parameters related to the to-be-processed speech segment in the following manner: carrying out framing treatment on the voice fragments to be treated to obtain voice fragments after framing treatment; windowing is carried out on each frame in the voice fragments after framing to obtain voice fragments after windowing; processing the windowed voice segment based on discrete Fourier transform to obtain a Fourier transformed voice segment and an energy value complex number corresponding to the Fourier transformed voice segment; performing multi-order Mel filtering on the energy value complex number to obtain Mel values corresponding to the voice fragments to be processed; based on the Mel value, obtaining Mel frequency cepstrum coefficient characteristic parameters related to the voice fragment to be processed through discrete chord transformation.

In an exemplary embodiment of the present invention, the extraction module 820 may perform multi-order mel filtering on the energy value complex number to obtain mel values corresponding to the speech segment to be processed in the following manner: determining function values of twenty-fourth order mel filter groups; based on the complex number of energy values and the function value of the twenty-fourth order Mel filtering group, obtaining Mel values of each frame in the voice fragment to be processed; based on the Mel value of each frame, the Mel value corresponding to the voice segment to be processed is obtained.

In an exemplary embodiment of the present invention, the extraction module 820 may be further configured to perform a first-order digital filtering process on the frame-processed speech segment, and use the frame-processed speech segment after the first-order digital filtering process as a final frame-processed speech segment.

Fig. 9 illustrates a physical schematic diagram of an electronic device, as shown in fig. 9, which may include: processor 910, communication interface (Communications Interface), memory 930, and communication bus 940, wherein processor 910, communication interface 920, and memory 930 communicate with each other via communication bus 940. Processor 910 can invoke logic instructions in memory 930 to perform a speech feature extraction method comprising: performing voice endpoint detection on the voice fragment to be processed to obtain a start frame and an end frame of the voice fragment to be processed; extracting voice characteristics of the voice fragments to be processed to obtain Mel frequency cepstrum coefficient characteristic parameters of the voice fragments to be processed; based on the starting frame and the ending frame, intercepting the characteristic parameters of the mel frequency cepstrum coefficient of the voice fragment to be processed to obtain the characteristic parameters of the effective mel frequency cepstrum coefficient of the voice fragment to be processed.

Further, the logic instructions in the memory 930 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the speech feature extraction method provided by the above methods, the method comprising: performing voice endpoint detection on the voice fragment to be processed to obtain a start frame and an end frame of the voice fragment to be processed; extracting voice characteristics of the voice fragments to be processed to obtain Mel frequency cepstrum coefficient characteristic parameters of the voice fragments to be processed; based on the starting frame and the ending frame, intercepting the characteristic parameters of the mel frequency cepstrum coefficient of the voice fragment to be processed to obtain the characteristic parameters of the effective mel frequency cepstrum coefficient of the voice fragment to be processed.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the speech feature extraction method provided by the above methods, the method comprising: performing voice endpoint detection on the voice fragment to be processed to obtain a start frame and an end frame of the voice fragment to be processed; extracting voice characteristics of the voice fragments to be processed to obtain Mel frequency cepstrum coefficient characteristic parameters of the voice fragments to be processed; based on the starting frame and the ending frame, intercepting the characteristic parameters of the mel frequency cepstrum coefficient of the voice fragment to be processed to obtain the characteristic parameters of the effective mel frequency cepstrum coefficient of the voice fragment to be processed.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

It will be further understood that although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of extracting speech features, the method comprising:

performing voice endpoint detection on a voice fragment to be processed to obtain a start frame and an end frame of the voice fragment to be processed;

extracting voice characteristics of the voice fragments to be processed to obtain Mel frequency cepstrum coefficient characteristic parameters related to the voice fragments to be processed;

and based on the starting frame and the ending frame, intercepting the characteristic parameters of the mel-frequency cepstrum coefficient of the voice fragment to be processed to obtain the characteristic parameters of the effective mel-frequency cepstrum coefficient of the voice fragment to be processed.

2. The method for extracting speech features according to claim 1, wherein the performing speech endpoint detection on the to-be-processed speech segment to obtain a start frame and an end frame for the to-be-processed speech segment includes:

carrying out framing treatment on the voice fragments to be treated to obtain voice fragments after framing treatment;

calculating the short-time zero crossing rate of each frame in the voice fragments after the framing treatment to obtain the zero crossing times of the voice signal waveform of each frame, and

short-time energy calculation is carried out on each frame in the voice fragments after the framing treatment, and the sum of absolute values of voice signal energy values of each frame is obtained;

And performing voice endpoint detection on the voice fragments to be processed based on the sum of the zero crossing times of the voice signal waveform of each frame and the absolute value of the voice signal energy value of each frame to obtain a start frame and an end frame of the voice fragments to be processed.

3. The method for extracting speech features according to claim 2, wherein said performing speech endpoint detection on the speech segment to be processed based on the sum of the number of zero crossings of the speech signal waveform of each frame and the absolute value of the speech signal energy value of each frame to obtain a start frame and an end frame for the speech segment to be processed comprises:

determining a short-time zero crossing rate threshold, a short-time energy threshold and a voice fragment length threshold;

determining a start frame of the speech segment to be processed based on a sum of absolute values of the speech signal energy values of each frame of the speech segment after the framing process and the short-time energy threshold in time order, and

according to the time sequence, determining the voice segment length of the voice segment to be processed based on the zero crossing times of the voice signal waveform of each frame in the voice segment after the framing processing and the short-time zero crossing rate threshold;

And determining the effective voice fragment length of the voice fragment to be processed based on the voice fragment length and the voice fragment length threshold, and taking the corresponding last frame in the effective voice fragment length as the ending frame of the voice fragment to be processed.

4. The speech feature extraction method according to claim 2, characterized in that before the framing processing of the speech segment to be processed, the method further comprises:

and carrying out normalization processing on the voice fragment to be processed, and taking the voice fragment to be processed after normalization processing as a final voice fragment to be processed.

5. The method according to claim 2, wherein after the framing the to-be-processed speech segment to obtain a framed speech segment, the method further comprises:

and carrying out first-order digital filtering processing on the voice fragments after framing processing, and taking the voice fragments after framing processing after first-order digital filtering processing as final voice fragments after framing processing.

6. The method for extracting speech features according to claim 1, wherein the step of extracting speech features from the speech segment to be processed to obtain mel-frequency cepstrum coefficient feature parameters about the speech segment to be processed includes:

windowing is carried out on each frame in the voice fragments after framing to obtain voice fragments after windowing;

processing the windowed voice segment based on discrete Fourier transform to obtain a Fourier transformed voice segment and an energy value complex number corresponding to the Fourier transformed voice segment;

performing multi-order Mel filtering on the energy value complex number to obtain Mel values corresponding to the voice fragments to be processed;

and obtaining the characteristic parameters of the mel frequency cepstrum coefficient of the voice fragment to be processed through discrete chord transformation based on the mel value.

7. The method of claim 6, wherein said performing multi-order mel filtering on said energy value complex numbers to obtain mel values corresponding to said speech segments to be processed comprises:

determining function values of twenty-fourth order mel filter groups;

based on the complex energy value and the function value of the twenty-fourth order Mel filter group, obtaining Mel values of each frame in the voice segment to be processed;

and obtaining the Mel value corresponding to the voice fragment to be processed based on the Mel value of each frame.

8. The speech feature extraction method according to claim 6, characterized in that before said framing the speech segment to be processed, the method further comprises:

9. A speech feature extraction apparatus, the apparatus comprising:

the detection module is used for detecting the voice end point of the voice fragment to be processed to obtain a start frame and an end frame of the voice fragment to be processed;

the extraction module is used for extracting voice characteristics of the voice fragments to be processed to obtain Mel frequency cepstrum coefficient characteristic parameters related to the voice fragments to be processed;

and the processing module is used for intercepting the characteristic parameters of the mel frequency cepstrum coefficient of the voice fragment to be processed based on the starting frame and the ending frame to obtain the characteristic parameters of the effective mel frequency cepstrum coefficient of the voice fragment to be processed.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the speech feature extraction method according to any one of claims 1 to 8 when the program is executed by the processor.

11. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the steps of the speech feature extraction method according to any one of claims 1 to 8.

12. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the speech feature extraction method according to any one of claims 1 to 8.