CN111768800A

CN111768800A - Voice signal processing method, apparatus and storage medium

Info

Publication number: CN111768800A
Application number: CN202010581908.1A
Authority: CN
Inventors: 曹刚
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2020-10-13

Abstract

The embodiment of the application relates to a voice signal processing method, equipment and a storage medium. The embodiment of the application comprises the following steps: acquiring audio features of a frame to be detected; obtaining the ratio of mute points in a time window with a preset length before the frame to be detected in the voice signal; determining a mute point ratio threshold according to the audio features; and judging whether the frame to be detected is a tail point frame or not according to the mute point ratio and the mute point ratio threshold. According to the embodiment of the application, the mute point occupation ratio in the preset length time window can be utilized, and the cepstrum characteristics of the current frame to be detected are used for dynamically adjusting the mute point occupation ratio threshold, so that the problem that the voice tail point detection is inaccurate due to the fixed mute point occupation ratio threshold is solved, and the accuracy and the real-time performance of the tail point frame detection are effectively improved.

Description

Voice signal processing method, apparatus and storage medium

Technical Field

The embodiments of the present application relate to, but not limited to, the field of computer technologies, and in particular, to a method, an apparatus, and a storage medium for processing a voice signal.

Background

With the development of artificial intelligence, speech recognition is a standard of a plurality of devices, and speech recognition takes speech as a research object, and a machine automatically recognizes and understands human spoken language through speech signal processing and pattern recognition.

The voice tail point detection plays a key role in voice recognition, namely, the voice tail point detection finds the tail point of voice in audio data, and the accuracy of the voice tail point detection plays a crucial role in the accuracy of voice recognition.

At present, the problem that the voice tail point is difficult to determine exists in voice tail point detection, so that the accuracy of voice recognition is greatly reduced.

Disclosure of Invention

The embodiment of the application provides a voice signal processing method, a device and a storage medium, which can improve the accuracy of voice tail point detection and recognition.

In a first aspect, an embodiment of the present application provides a speech signal processing method, including: acquiring audio features of a frame to be detected in a voice signal; obtaining the ratio of mute points in a time window with a preset length before a frame to be detected; obtaining a mute point ratio threshold according to the audio features; and determining a tail point frame in the voice signal according to the mute point ratio and the mute point ratio threshold.

In a second aspect, an embodiment of the present application provides an electronic device, including: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the speech signal processing method of the first aspect.

In a third aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-executable instructions are used to execute the speech signal processing method described in the first aspect.

The embodiment of the application comprises the following steps: acquiring audio features of a frame to be detected in a voice signal; obtaining the ratio of mute points in a time window with a preset length before the frame to be detected; determining a mute point ratio threshold according to the audio features; and determining that the frame to be detected is a tail point frame according to the mute point ratio and the mute point ratio threshold. According to the embodiment of the application, the mute point occupation ratio in the preset length time window can be utilized, and the cepstrum characteristics of the current frame to be detected are used for dynamically adjusting the mute point occupation ratio threshold, so that the problem that the voice tail point detection is inaccurate due to the fixed mute point occupation ratio threshold is solved, and the accuracy and the real-time performance of the tail point frame detection are effectively improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

Fig. 1 is a flowchart of a speech signal processing method according to an embodiment of the present application;

FIG. 2 is a flow chart of a speech signal processing method according to another embodiment of the present application;

FIG. 3 is a flow chart of a method for processing a speech signal according to another embodiment of the present application;

FIG. 4 is a flow chart of a speech signal processing method according to another embodiment of the present application;

FIG. 5 is a flow chart of a method for processing a speech signal according to another embodiment of the present application;

FIG. 6 is a flow chart of a speech recognition method provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a speech endpoint detection apparatus according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a speech endpoint detection apparatus according to an embodiment of the present application;

fig. 9 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The detection of the voice tail point is an important link of voice recognition, is the most important step in the voice signal processing process, and the accuracy of the detection directly influences the speed and the result of the voice signal processing.

The voice signal processing method in the related art often has the problems that the voice tail point detection result is not accurate enough and the self-adaptive adjustment of the voice tail point detection cannot be realized. For example, in the related art, the detection of the voice tail point is easily affected by noise, so that the detection is not accurate enough; or, because the speaking speed of each person is different, the voice tail point detection method in the related art is prone to the problems of misidentification of the tail point frame or too slow recognition.

Based on this, the embodiment of the application provides a speech signal processing method, a device and a storage medium, which can utilize the ratio of the mute points in a time window with a preset length and dynamically adjust the threshold of the ratio of the mute points by using the cepstrum characteristics of the current frame to be detected, thereby overcoming the problem of inaccurate detection of the speech tail points existing in the fixed threshold of the ratio of the mute points, and effectively improving the accuracy and the real-time performance of the detection of the tail point frames.

It should be noted that, in the following embodiments, the terminal/device may be a mobile terminal device or a non-mobile terminal device. The mobile terminal equipment can be a mobile phone, a tablet computer, a notebook computer, a palm computer, vehicle-mounted terminal equipment, wearable equipment, a super mobile personal computer, a netbook, a personal digital assistant and the like; the non-mobile terminal equipment can be a personal computer, a television, a teller machine or a self-service machine and the like; the embodiments of the present application are not particularly limited.

In some embodiments, the electronic device may include that the electronic device may include a processor, an external memory interface, an internal memory, a Universal Serial Bus (USB) interface, a charging management module, a power management module, a battery, an antenna, a mobile communication module, a wireless communication module, an audio module, a speaker, a microphone, an earphone interface, a sensor module, a button, a motor, an indicator, a camera, a display screen, a Subscriber Identification Module (SIM) card interface, and the like. Wherein, the sensor module may include a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, etc.

In some embodiments, the electronic device may implement audio functions via an audio module, a speaker, a microphone, a headset interface, an application processor, and/or the like. Such as music playing, recording, etc.

In some embodiments, the audio module is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module may also be used to encode and decode audio signals.

In some embodiments, the audio module may be disposed in the processor, or a portion of the functional modules of the audio module may be disposed in the processor. Speakers, also known as "horns," are used to convert electrical audio signals into speech signals. The electronic device may listen to music through a speaker or listen to a hands-free conversation. A receiver, also called "handset", is used to convert an audio electrical signal into a speech signal. When the electronic equipment answers a call or voice information, the voice can be answered by placing the telephone receiver close to the ear of a person. Microphones, also known as "microphones" or "microphones", are used to convert speech signals into electrical signals. When making a call or sending voice information, a user can input a voice signal into the microphone by making a sound by approaching the microphone through the mouth of the user. The electronic device may be provided with at least one microphone.

In some embodiments, the electronic device may be provided with two microphones to achieve a noise reduction function in addition to collecting voice signals.

In some embodiments, the electronic device may further include three, four, or more microphones to collect a voice signal and reduce noise, and may further identify a sound source and implement a directional recording function.

In a first aspect, an embodiment of the present application provides a speech signal processing method for an electronic device.

In some embodiments, referring to fig. 1, a voice signal processing method may include:

step S1100, acquiring audio characteristics of a frame to be detected in a voice signal;

step S1200, obtaining the ratio of mute points in a time window with a preset length before a frame to be detected;

step S1300, obtaining a mute point ratio threshold according to the audio characteristics;

and step S1400, determining a tail point frame in the voice signal according to the mute point ratio and the mute point ratio threshold.

It should be noted that S1100, S1200, S1300, and S1400 only represent reference numerals, and should not be construed as limiting the order of steps. In particular, the sequence of the steps S1100 and S1200 is not limited.

In some embodiments, the frame to be detected in step S1100 is a speech frame to be detected, and the speech end point detection is to detect whether the speech frame to be detected is an end point frame.

In some embodiments, the audio features may be time domain features or frequency domain features, both of which may be extracted from a short-time audio, where the time domain features are features extracted directly on the basis of an original speech signal, and the frequency domain is a feature obtained by performing fourier transform on the original speech signal, converting the original speech signal into the frequency domain, and then extracting the features in the frequency domain, where the ordinate of each sampling point in the time domain represents an energy amplitude value of the point; the ordinate of each point on the frequency domain represents the energy level of its corresponding band within a short time frame.

In some embodiments, the audio features are cepstral features, which belong to the frequency domain of the audio features, it can be understood that the speech signal is inherently non-stationary, i.e. it changes dramatically at short intervals, e.g. 10ms, and the audio signal has short-term stability, so the sample points are usually processed in a short time or audio frame. The audio signal is composed of different energies carried by different frequencies, more information in the voice signal can be obtained in a frequency spectrum, the voice signal can be divided into a plurality of voice frames, each frame of voice signal corresponds to a frequency spectrum, the frequency spectrum is calculated through short-time fast Fourier transform, the frequency spectrum represents the relation between the frequency and the energy of the voice signal, the peak value in the frequency spectrum represents the main frequency component of the voice, the peak value of the voice can also be called as a formant, and the formant carries a voice identification attribute and can be used for identifying different voices. The speech frame to be detected may be subjected to the cepstral analysis described above.

In some embodiments, the cepstrum generation process is a homomorphic signal process, which aims to transform the nonlinear problem into a linear problem, the original speech signal is actually a convolution signal, the first step is to transform it into a multiplicative signal by convolution, it is understood that the convolution signal in the time domain is equivalent to the multiplicative signal in the frequency domain, the second step is to transform the multiplicative signal into an additive signal by taking the logarithm, and the third step is to perform inverse fourier transform to restore it into a convolution signal, where the discrete time domains are different from each other although both before and after the time domain sequence, and the output signal is called the cepstrum domain.

Wherein the cepstral function can be expressed as:

C(q)＝|IF(log(s(f)))|^2

where q is an argument, f is an inverse frequency, s (f) is a fourier transform of the signal time domain signal s (t), log () is a logarithm, and IF is an inverse fourier transform.

It can be understood that, on one hand, the cepstrum can effectively remove noise interference, its separation characteristic makes periodic signals easier to detect, and the cepstrum gives higher weight to low amplitude components and smaller weight to high amplitude components during logarithmic conversion of power spectrum, the weighting result is favorable for highlighting periodic small signals, the general voice volume at the voice tail point is smaller, so that the voice amplitude is correspondingly smaller, the amplitude weighting can effectively improve the periodic signal intensity of the voice tail point frame, the cepstrum feature of the voice frame can effectively remove noise, and the cepstrum audio characteristic of the voice frame is more highlighted.

On the other hand, the cepstrum feature may depict the user's instantaneous speech rate corresponding to the frame to be detected, two important parameters in the cepstrum are zero-dimensional cepstrum C0 (i.e., C0 value in the cepstrum feature) and peak position t0 of the pitch period in the cepstrum, which are referred to as zero-dimensional cepstrum C0 and peak position t0 in the cepstrum feature for short, the slow speech is behind the cepstrum peak position t0 of the fast speech, so the pitch period of the slow speech is larger than the pitch period of the fast speech, otherwise, the peak position t0 of the fast speech is ahead of the peak position t0 of the slow speech, so the fast pitch period is smaller than the pitch period of the slow speech; meanwhile, the cepstrum has an obvious peak value at a position corresponding to a pitch period, so that the peak value characteristic in the cepstrum can be more prominent when the cepstrum is applied to tail point frame detection, and finally, an obvious mute frame threshold value can be obtained by combining zero-dimensional cepstrum C0 in the cepstrum characteristic.

In some embodiments, the preset-length time window in step S1200 is the total number of frames that are fixed frame lengths before the preset frame to be detected, where T represents the preset-length time window in this embodiment. For example, if the preset length time window T is 70, it represents the total number of frames with the length of 70 speech frames in the preset length time window T, and the 70 speech frames may include a mute frame or an un-mute frame, and assuming that the number of mute frames in the preset length time window is n, the mute point ratio is n/T.

It will be appreciated that the value of T is a fixed value, and for example, assume that the speech stream start frame is S₀The frame to be detected is a speech frame S after a time window T with a preset length₁When the frame to be detected is backwards moved by 1 voice frame length, presetting a length time windowThe total number of the voice frames of T is not changed, and the starting point of the time window T with the preset length is correspondingly moved backwards by 1 voice frame length.

In some embodiments, the method for processing a speech signal may further include a method for determining a speech stream start frame using an energy value of speech, and referring to fig. 2, before S1100, the method may further include:

step S1500, setting a voice energy threshold Elow;

step S1600, acquiring a voice frame of which the energy of the continuous MO is lower than a voice energy threshold Elow;

step S1700, acquiring a voice frame of which the voice energy of the MO in continuous preset quantity is higher than a voice energy threshold value Elow;

in step S1800, a speech frame where the speech energy value starts to increase is obtained, and the speech frame is regarded as a speech stream start frame S0.

It can be understood that energy is a carrier of a signal, and can be used to judge the existence of a speech signal, but a non-noise environment hardly exists, so the criterion for judging the beginning of a speech frame is not that an energy threshold is 0, but a speech energy threshold Elow is set, the speech energy threshold is obtained through a statistical experiment, or an empirical value, when the energy of the speech frames of the continuous preset number MO is less than the threshold, and the speech energy of the continuous preset number MO is higher than the speech energy threshold Elow, the speech frame where the speech energy value increases is regarded as a speech stream beginning frame S₀。

In some embodiments, the cepstrum feature may include one or more of a zero-dimensional cepstrum C0 of the frame to be detected and a peak position t0 of the frame to be detected, and the silence ratio threshold may be determined by the zero-dimensional cepstrum C0 of the frame to be detected or by the peak position t0 of the frame to be detected, or by the zero-dimensional cepstrum C0 of the frame to be detected and by the peak position t0 of the frame to be detected together.

In some embodiments, the step S1300 may include:

step 1310, determining a mute point ratio threshold according to the zero-dimensional cepstrum C0 of the frame to be detected.

In some embodiments, a mute point ratio threshold is calculated according to a ratio of the first threshold adjustment parameter and the second threshold adjustment parameter, wherein the mute point ratio threshold is used for judging whether the frame to be detected is a tail-point frame. It can be understood that the smaller the zero-dimensional cepstrum C0, the smaller the first threshold adjustment parameter, which affects the mute point occupancy threshold of the frame to be detected.

More specifically, referring to fig. 3, the step S1310 may include:

step S1311, determining a mute point ratio threshold according to the zero-dimensional cepstrum C0 and a first calculation formula;

in step S1312, the first calculation formula may include:

R＝R₂/R₁；

R₁＝a*C0；

wherein R is a mute point ratio threshold, R₁Adjusting the parameter for the first threshold, R₂Is a second threshold adjustment parameter, and a is a first threshold adjustment constant.

In some embodiments, the first threshold adjustment constant a and the second threshold adjustment parameter R₂The value of a can be obtained by multiple statistical tests or an empirical value, and the range of the value of a is between 0 and 1.

And when the mute point ratio is larger than a first mute point ratio threshold value determined by the first calculation formula, judging that the frame to be detected is a tail point frame.

It can be understood that, assuming that the total number of the mute frames is n and the preset length time window is T, the calculation formula of the mute point ratio r in the preset length time window at this time is as follows:

r＝n/T；

when R is greater than R, judging the current frame to be detected as a tail point frame; and if R is less than R, judging that the current frame to be detected is not the tail point frame.

If the frame to be detected is a tail point frame, intercepting voice stream data between a voice stream starting frame and the tail point frame, decoding and identifying the voice stream data to finally obtain response information, and if the frame to be detected is not the tail point frame, starting to detect the next frame of the current frame and continuing to detect the tail point frame.

It can be understood that, the determining of the mute frame in step S1200 described above uses a mute frame determination method, which mainly includes:

step S1210, extracting multi-threshold zero crossing rate of a frame of audio data, summing weighted values of the multi-threshold zero crossing rate to obtain total zero crossing rate Z, weighting and summing the total zero crossing rate Z, and setting thresholds T with different heights for multi-threshold zero crossing rate detection₁、T₂、T₃And T is₁<T₂<T₃Each frame is calculated to correspond to T by the following formula₁、T₂、T₃Three kinds of threshold zero crossing rate Z₁、Z₂、Z₃。

Z_n＝∑{|sgn[x(n)-T_n]-sgn[x(n-1)-T_n]|+|sgn[x(n)+T_n]-sgn[x(n-1)+T_n]|}*w(n-w)；

The total zero-crossing rate Z is represented by the following formula:

Z＝W₁*Z₁+W₂*Z₂+W₃*Z₃；

wherein, W₁、W₂、W₃Is a zero-crossing rate weight, Z₀Is the total zero-crossing rate cutoff value.

Step S1220, using multi-threshold zero crossing rate to weight and pre-judge the mute, if the total zero crossing rate Z of a frame of audio data is less than the set threshold Z₀Judging the sound is mute;

step S1230, if the frame is not silent, extracting the composite feature of the frame of audio data;

wherein the composite features may include zero crossing rate, short-time energy value, mel-scale cepstral coefficients based on variable resolution spectrum;

step S1240, the compound characteristics of the audio are judged by using a two-classification support vector machine, and two types of results of normal voice and silence are obtained.

It can be understood that the calculation of the ratio r of the mute point in the time window with the preset length can be accurately performed only if the accuracy of the mute frame identification is improved, so that a numerical basis is provided for the accurate detection of the voice tail point in the embodiment, and the detection of the voice tail point is finally more accurate.

According to the embodiment, the mute point proportion threshold can be obtained through the zero-dimensional cepstrum C0 in the cepstrum characteristics and a corresponding calculation formula, and compared with the mute point proportion in a preset length time window, whether a frame to be detected is a tail point frame or not is finally and successfully judged, the detection and the identification of the voice tail point frame are more accurate, and the collected real-time frame in the voice stream has the effects of real-time performance and automatic adjustment.

In some embodiments, the cepstrum feature may include a peak position t0 of the frame to be detected, and the mute point occupation threshold is calculated according to a ratio of the first threshold adjustment parameter to the second threshold adjustment parameter; the second threshold value adjusting parameter is positively correlated with the peak value position t0, and the larger the peak value position t0 is, the larger the second threshold value adjusting parameter is; conversely, the smaller the peak position t0, the smaller the second threshold adjustment parameter.

The step S1300 may include:

step S1320, determining the mute point ratio threshold according to the peak position t0 of the frame to be detected.

More specifically, referring to fig. 4, step S1320 may include the steps of:

step S1321, determining a mute point ratio threshold according to the peak position t0 of the frame to be detected and a second calculation formula;

in step S1322, the second calculation formula may include:

R＝R₂/R₁；

R₂＝b*t0；

wherein R is a mute point ratio threshold, R₁Adjusting the parameter for the first threshold, R₂Is a second threshold adjustment parameter, b is a second threshold adjustment constant, t0 is the peak position, second threshold adjustment constants b and R₁The value obtained by statistical test or empirical value can be selected, and the value range of b is between 0 and 1.

Specifically, the peak position t0 of the cepstrum is the peak position of the gene cycle of the frame to be detected, and is used for adjusting the second threshold adjustment parameter corresponding to the peak position t0, and the later the peak position is, the larger t0 is, so the peak position is also an important factor influencing the detection of the speech tail point.

Assuming that the total number of the mute frames is n and the preset length time window is T, the calculation formula of the ratio r of the mute points in the preset length time window at this time is as follows:

r＝n/T；

and when R is greater than R, judging that the current frame to be detected is a tail point frame, and when R is less than R, judging that the current frame to be detected is not the tail point frame.

If the frame to be detected is a tail point frame, intercepting voice data between a voice stream starting frame and the tail point frame, decoding and identifying the voice data to finally obtain response information, and if the frame to be detected is not the tail point frame, starting to detect the next frame of the current frame and continuing to detect the tail point frame.

According to the embodiment, the mute point occupation ratio threshold can be obtained through the peak position t0 in the cepstrum feature and a corresponding calculation formula, and compared with the mute point occupation ratio in a preset length time window, whether a frame to be detected is a tail point frame or not is finally and successfully judged, the effect of more accurate detection of the voice tail point frame can be achieved, and the real-time frame in the voice stream is acquired, so that the effect of real-time performance and automatic adjustment is achieved.

In some embodiments, the cepstrum features of the frames to be detected may include a zero-dimensional cepstrum C0 of the frames to be detected and a peak position t0 of the frames to be detected; calculating to obtain a mute point ratio threshold according to the ratio of the first threshold adjusting parameter to the second threshold adjusting parameter; wherein the first threshold adjustment parameter is positively correlated with the zero-dimensional cepstrum C0; and, the second threshold adjustment parameter is positively correlated with the peak position t 0.

The step S1300 may include:

and S1330, determining a mute point ratio threshold according to the zero-dimensional cepstrum C0 of the frame to be detected and the peak position t0 of the frame to be detected.

More specifically, referring to fig. 5, step S1330 may include the following sub-steps:

step S1331, determining a mute point ratio threshold according to the zero-dimensional cepstrum C0, the peak position t0 and a third calculation formula;

in step S1332, the third calculation formula may include:

R＝R₂/R₁；

R₁＝a*C0；

R₂＝b*t0；

wherein R is a mute point ratio threshold, R₁Adjusting the parameter for the first threshold, R₂Is a second threshold adjustment parameter, and a and b are threshold adjustment constants.

Wherein, the value ranges of a and b are both 0 to 1.

It can be understood that C0 in the frame cepstrum feature to be detected affects the denominator in the third calculation formula, and the peak position t0 in the cepstrum feature affects the numerator in the third calculation formula, and the numerator and the denominator in the third calculation formula are all variables, and two variables are adjusted simultaneously, that is, C0 in the frame cepstrum feature to be detected adjusts the first threshold adjustment parameter, and the peak position t0 in the cepstrum feature adjusts the second threshold adjustment parameter, so that the final result has more real-time performance and reliability, and the robustness of voice tail point detection can be greatly improved.

It can be understood that the slower the user speech speed is, the smaller the corresponding zero-dimensional cepstrum C0 is, and the larger the peak position t0 of the cepstrum pitch period is, so that the smaller the R1 as the denominator is, the larger the R2 as the numerator is, the larger the mute point proportion threshold value is, and the larger the mute point proportion threshold value is, the more the mute point proportion threshold value is required to be, so that the voice tail point detection is finally more accurate.

r＝n/T；

If the frame to be detected is a tail point frame, intercepting voice stream data between a voice stream starting frame and the tail point frame, decoding and identifying the voice stream data to finally obtain response information, and if the frame to be detected is not the tail point frame, starting to detect the next frame of the current frame and continuing to detect the tail point frame. The embodiment can reduce the sensitivity of a single parameter and increase the robustness of the system by adjusting the denominator and the numerator of the ratio.

In some embodiments, the preset length time window is a time window of 40-80 frame lengths, wherein the frame lengths are 20-30 ms.

In some embodiments, the frame length is selected to be 25ms, and taking a time window with a length of 40 frames as an example of the preset length time window, ideally, if there is no overlapping portion in each voice frame and each voice frame length is 25ms, the time length of the preset length time window is a product of the two, i.e. 40 × 25ms ═ 1 s.

Taking the frame length as 25ms, and taking a time window with 80 frame lengths as an example of a preset length time window, ideally, if each voice frame has no overlapping part and the length of each voice frame is 25ms, the time length of the preset length time window is the product of the two, that is, 80 × 25ms ═ 2 s.

Taking the frame length as 20ms, and taking a time window with 40 frame lengths as an example of a preset length time window, ideally, if each voice frame has no overlapping part and each voice frame length is 20ms, the time length of the preset length time window is a product of the two, that is, 40 × 20ms is 0.8 s.

Taking the frame length as 20ms, and taking a time window with 80 frame lengths as an example of a preset length time window, ideally, if each voice frame has no overlapping part and each voice frame length is 20ms, the time length of the preset length time window is a product of the two, that is, 80 × 20ms is 1.6 s.

Taking the frame length as 30ms, and taking a time window with 40 frame lengths as an example of a preset length time window, ideally, if each voice frame has no overlapping part and each voice frame length is 30ms, the time length of the preset length time window is a product of the two, that is, 40 × 30ms is 1.2 s.

Taking the frame length as 30ms, and taking a time window with 80 frame lengths as an example of a preset length time window, ideally, if each voice frame has no overlapping part and each voice frame length is 30ms, the time length of the preset length time window is a product of the two, that is, 80 × 30ms is 2.4 s.

It can be understood that, under the condition that the preset time window time length of 0.8s to 2.4s is included, more application scenarios can be covered, so that the application range of the embodiment is conveniently expanded.

It can be understood that the preset length time window may be set according to an empirical value or an experimental value, if the preset length time window is too short, when the voice stream of the user with slow speech speed is subjected to tail point detection, because the ratio of the mute points in the preset length time window is higher, the non-tail point frame is erroneously determined as the tail point frame (for example, a middle pause in the voice stream is erroneously detected as the tail point frame), and thus the accuracy of the voice tail point frame detection is reduced; if the time window with the preset length is too long, the real tail point frame of the user with high speech speed may be missed, and the length of the long frame is arranged between the real tail point frame and the frame to be detected at the position of the preset length window, so that long-interval delay occurs during speech recognition, not only too much system resources are occupied, but also the accuracy and the real-time performance of the speech tail point detection are reduced, and the user experience is influenced.

The embodiment of the application comprehensively considers the factors, and sets the preset length time window to be 40-80 frames according to the general situation of the voice stream generated by the user speaking, thereby effectively improving the accuracy and the real-time performance of voice tail point detection.

In some embodiments, step S1400 may include:

step S1410: when the mute point proportion is larger than the mute point proportion threshold value, determining the frame to be detected as a tail point frame;

step S1420: and if not, continuously acquiring the next voice frame of the frame to be detected as the frame to be detected for detection until the ratio of the mute points is greater than the threshold value of the ratio of the mute points, and determining that the frame to be detected is the tail point frame.

It can be understood that the purpose of the embodiment of the present application is to finally determine the speech end point, and the method for determining that the frame to be detected is the end point frame in step S1410 has been described in the foregoing embodiment, and is not described herein again. Meanwhile, the start frame in this embodiment is also the start frame S in the above embodiment₀The determination method is not described herein. In step S1420At the beginning of frame S₀After a time window of a preset length T, a speech frame S₁If the ratio of the mute point to the voice frame is less than the threshold value of the ratio of the mute point to the voice frame to be detected, and the voice frame to be detected is judged to be a non-tail-point frame, the same tail-point frame detection method is used for the next frame to be detected S₂Detecting; if it is judged S₂If the frame is also a non-end frame, the steps S1100-S1400 are continued to detect S₂Whether the next frame is a tail-point frame or not, and so on, and the frame S to be detected₁The next tail point detection point is S in sequence₂、S₃、S₄… … correspond to, S₂、S₃、S₄… … are start frames S, respectively₀The later T +1, T +2 and T +3 … … frames until the ratio of the mute point is greater than the threshold value of the ratio of the mute point, and the detected speech frame S is detected_NIf (N is 2, 3 … N) is the end point frame, the detection is ended.

According to the embodiment of the invention, the mute point ratio in the preset length time window, the peak value t0 in the cepstrum characteristic of the frame to be detected and the zero-dimensional cepstrum C0 are used for dynamically adjusting the mute point ratio threshold, and whether the frame to be detected is the tail point frame is determined according to the mute point ratio and the mute point ratio threshold, so that the defects that a related voice signal processing method is easily influenced by noise and the average speech speed is not easy to obtain can be overcome, the robustness of voice tail point detection in voice recognition can be greatly improved, the accuracy and the real-time performance of the voice tail point detection are ensured, and the user experience is improved.

In some embodiments, referring to fig. 6, the speech processing method may further include:

step S1900, acquiring voice stream data before a tail point frame;

step S2000, recognizing the voice stream data and outputting the response information.

The purpose of speech recognition is to allow a machine to understand what a person says, understand the person's intent, and react accordingly.

In step S1900, the voice stream data before the end-point frame is acquired.

In some embodiments, the determination is made at step S1900After the end frame of the speech is determined, the speech stream between the start frame and the end frame needs to be obtained in order to obtain the speech stream segment from the beginning of the speech stream to the end of the speech stream, for example, assume that the start frame of the speech stream is S₀Determining the tail point frame of the voice stream as S₁Intercepting the voice stream start frame S₀To the end of the speech stream frame S₁Voice stream in between, if S₁Is judged as a non-end frame, and S₁Next frame S₂If the frame is a tail point frame, intercepting the voice stream start frame S₀To the end of the speech stream frame S₂And by analogy, acquiring voice stream data between a voice stream starting frame and a voice stream tail point frame.

In some embodiments, the response message may be a text message, or may be an audio or video or other type of response message. For example, the voice tail point detection and voice recognition method in this embodiment is applied to voice wake-up, and then the voice stream between the voice stream start frame and the tail point frame is intercepted after the tail point frame is successfully determined by the voice signal processing method, and then the voice instruction corresponding to the voice stream recognized by the voice recognition method is executed by the voice recognition method, for example, if the user needs to wake up the terminal device interface by voice, the terminal device interface may be automatically lit up when the terminal device voice inputs "please light up the screen". For another example, the voice tail point detection and voice recognition method in this embodiment is applied to voice translation, and if the feedback of the response information is in an audio form, the voice stream between the start frame and the tail frame of the voice stream is intercepted after the tail frame is successfully determined by the voice signal processing method, and then the voice stream recognized by the voice recognition method is executed by the voice signal processing method, and the audio information corresponding to the voice stream is played on the voice translation interface. For another example, the voice endpoint detection and voice recognition method in this embodiment is applied to a driving assistant of a vehicle-mounted device, where the driving assistant is an application program installed on the vehicle-mounted device, and is capable of locating a location in real time, starting travel route navigation, knowing road conditions around in real time, and performing services such as intelligently controlling vehicle-mounted music and calls, so as to help a user to use the driving assistant through voice control, and when the user needs to use the driving assistant, the user inputs voice information using a microphone on the vehicle-mounted device, for example, "please go to a certain place", and then the driving assistant receives the voice signal and then performs voice endpoint detection and voice recognition to give response information, so as to facilitate the driver to notice whether the driving assistant receives a correct voice signal, and often the driving assistant gives a response in an audio or video form, for example, the method comprises the following steps of playing audio, planning a route to a certain place, asking for confirmation, or displaying a destination route on a display screen of the vehicle-mounted equipment after voice translation is recognized in a video mode, but the method is not limited to the application scenes, and can also be applied to the scenes of voice signal input such as a voice assistant and a smart sound box.

It can be understood that the speech recognition is to decode the cepstrum feature sequence corresponding to the intercepted effective frame sequence into the response information of the speech stream by using the acoustic and language models.

The speech recognition process is mainly divided into three parts:

the first part, extracting characteristic parameters, preprocessing the voice signal, extracting voice characteristic parameters to represent the voice signal;

a second part, training a model, and training an acoustic model and a language model by using the characteristic parameters;

and the third part is pattern matching, namely matching the characteristic parameters of the voice signal to be recognized with the trained model to generate a recognition effect.

The acoustic model is the bottom model of the recognition system, and is the most critical part of the speech recognition system. The objective of the acoustic model is to calculate the distance between the sequence of speech feature vectors and each pronunciation template. The acoustic model is designed to find the smallest recognition unit, which is closely related to the pronunciation characteristics of the language. The size of the recognition unit has a large influence on the size of the voice data amount, the recognition rate, and the flexibility. Wherein the recognition unit may be a word, a demi-syllable, or a phoneme.

The language model refers to some rules or grammatical structures in the language, and may also be a statistical model representing word or word context. Due to the complexity of the voice signals, the phenomenon of overlapping connection exists between different pronunciations, and even people can not distinguish the pronunciations if the front and the back of the single tone are not connected, so that the distinguishing degree of the acoustic model can be improved by means of the language model. The embodiment utilizes a relatively mature model, namely a statistical language model, which extracts the statistical relationship between different characters and words through statistics of a large number of text files.

It is understood that the voice stream data before the end point frame is obtained is the voice stream from the start frame to the end point frame of the obtained voice stream, for example, the start frame of the voice stream is marked as S₀The time window of the preset length is T, namely S₀Delaying speech frame S at length of T frames₁As the frame to be detected, if it is detected that the speech frame is a tail-end frame by using the method in the first aspect, the speech stream intercepted in this embodiment is S₀To S₁If it is detected with the method of the first aspect that the frame is not a tail-point frame, detecting the next tail-point frame S₂If it is a tail-point frame, if S₂If it is a tail-point frame, the voice stream intercepted in this embodiment is S₀To S₂If it is detected with the method of the first aspect that the frame is not a tail-point frame, detecting the next tail-point frame S₃And if the frame to be detected is not the tail point frame, continuing detecting the next frame to be detected by delaying the next frame until the frame to be detected is the tail point frame, and intercepting the voice stream between the start frame and the tail point frame and performing voice recognition processing.

The embodiment can realize that the voice stream between the start frame and the tail frame is intercepted after the tail frame is judged, can recognize the voice stream in real time and output corresponding response information.

In a second aspect, an embodiment of the present application provides an electronic device.

In some embodiments, referring to fig. 7, the electronic device may include one or more processors 110; a storage device 120 for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to perform: the speech signal processing method according to the first aspect.

In some embodiments, the electronic device may be a mobile terminal device or a non-mobile terminal device. The mobile terminal equipment can be a mobile phone, a tablet computer, a notebook computer, a palm computer, vehicle-mounted terminal equipment, wearable equipment, a super mobile personal computer, a netbook, a personal digital assistant and the like; the non-mobile terminal equipment can be a personal computer, a television, a teller machine or a self-service machine and the like; the embodiments of the present invention are not particularly limited.

For example, the electronic device in this embodiment is a mobile terminal device, a user uses a microphone and other devices on the mobile terminal to acquire a speech signal of a speaker, and performs endpoint detection and speech recognition by using the speech processing method of the first aspect, and the mobile terminal device provides response information, where the response information may be in a text form, or may be audio or video response information; for example, the voice signal processing method of the first aspect performs voice recognition and is applied to a driving assistant of a vehicle-mounted device, where the driving assistant is an application program installed on the vehicle-mounted device, and is capable of locating a location in real time, starting travel route navigation, knowing road conditions nearby in real time, and performing services such as intelligently controlling vehicle-mounted music and calls, so as to help a user to use the driving assistant through voice control, when the user needs to use the driving assistant, a microphone on the vehicle-mounted device is used to input voice information, for example, "please go to a certain place", the driving assistant obtains the voice signal and then performs voice tail detection and voice recognition to give response information, and in order to facilitate the driver to notice whether the driving assistant receives a correct voice signal, the driving assistant often gives a response in an audio or video form, for example, the audio playing "go somewhere, route has been planned, please confirm" or the recognized voice is translated and then the destination route is displayed on the display screen of the vehicle-mounted device in a video form, but the method is not limited to the above application scenarios, and can also be applied to the scenarios of voice signal input application, such as voice assistant, smart speaker, and the like. It should be noted that the above application scenarios have no requirement on the network, and can be implemented in the case of an offline network, a wired network, or a wireless network.

In some embodiments, the electronic device performs the speech signal processing method of steps S1100 to S2000 as in the first aspect embodiment described above.

In a third aspect, embodiments of the present application provide a computer-readable storage medium.

In some embodiments, the computer-readable storage medium stores computer-executable instructions for performing: a speech signal processing method as in the first aspect.

In some embodiments, a computer-readable storage medium stores computer-executable instructions for performing the speech signal processing method in steps S1100 to S2000 as in the first aspect embodiment described above.

In a fourth aspect, an embodiment of the present application provides a speech processing apparatus.

In some embodiments, referring to fig. 8, the speech processing apparatus may include:

the audio extraction module 210 is configured to obtain audio features of a frame to be detected in a speech signal;

the audio processing module 220 is connected to the audio extracting module 210, and configured to obtain a ratio of the mute point in a time window of a preset length before the frame to be detected, determine a threshold of the ratio of the mute point according to the audio feature of the frame to be detected, and determine that the frame to be detected is a tail-point frame according to the ratio of the mute point and the threshold of the ratio of the mute point.

In some embodiments, the audio processing module 220 performs extraction of cepstrum features on the audio features obtained by the audio extraction module 210, where the gene cycle peak position t0 and the zero-dimensional cepstrum C0 feature of each speech frame in the extracted cepstrum features are important detection parameters for performing tail point detection. When tail point detection is carried out, each frame to be detected is detected by a time window with a preset length forward, namely, the mute point proportion in the time window with the preset length in front of the frame to be detected is obtained, then, the mute point proportion threshold value is dynamically adjusted according to the peak position t0 and the zero-dimensional cepstrum C0 of the gene period in the cepstrum characteristics of the current frame to be detected, and finally, whether the frame to be detected is the tail point frame or not is judged according to the mute point proportion and the mute point proportion threshold value.

In some embodiments, the speech processing apparatus may further include:

and the voice signal capturing module is connected with the audio extraction module and is used for receiving the analog voice signal, converting the analog voice signal into a digital voice signal and transmitting the digital voice signal to the audio extraction module.

Specifically, the voice signal capturing module is a device that the electronic device converts an analog voice signal into a digital voice signal through a microphone and the like, the sampling rate is generally 16KHZ, the depth is 8 bits, a common audio segment with a length of 25ms and a front-back overlapping length of 10ms is used as a voice frame, namely a minimum unit for feature extraction, and each voice frame can comprise a plurality of sampling points.

In some embodiments, referring to fig. 9, the speech processing apparatus may further include:

the obtaining module 230 is connected to the audio processing module 220, and is configured to obtain voice stream data before the end point frame.

In some embodiments, when the audio processing module 220 detects that the ratio of the mute point of the frame to be detected exceeds the threshold value of the ratio of the mute point, the frame to be detected is determined to be the tail frame of the voice stream, and at this time, the obtaining module 230 intercepts the voice stream data between the start frame and the tail frame of the voice stream.

The identifying module 240 is connected to the acquiring module 230, and is configured to identify the voice stream data transmitted by the acquiring module 230 and output response information.

In some embodiments, the recognition module 240 may include an acoustic model and a language model, and may decode the acquired voice stream data into corresponding response information.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media may include, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to one of ordinary skill in the art.

While the preferred embodiments of the present invention have been described, the present invention is not limited to the above embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and such equivalent modifications or substitutions are included in the scope of the present invention defined by the claims.

Claims

1. A speech signal processing method, comprising:

acquiring audio features of a frame to be detected in a voice signal;

obtaining the ratio of mute points in a time window with a preset length before the frame to be detected;

determining a mute point ratio threshold according to the audio features;

and determining a tail point frame in the voice signal according to the mute point ratio and the mute point ratio threshold.

2. The method of claim 1, wherein the audio features are cepstral features.

3. The method of claim 2, wherein the cepstral features comprise one or more of:

the zero-dimensional cepstrum C0 of the frame to be detected and the peak position t0 of the frame to be detected.

4. The method according to claim 3, wherein when the cepstral features include the zero-dimensional cepstrum C0 of the frame to be detected;

the obtaining the mute point ratio threshold according to the audio features includes:

calculating to obtain a mute point ratio threshold according to the ratio of the first threshold adjusting parameter to the second threshold adjusting parameter; wherein the first threshold adjustment parameter is positively correlated with the zero-dimensional cepstrum C0.

5. The method according to claim 3, wherein the cepstral features comprise a peak position t0 of the frame to be detected;

calculating to obtain a mute point ratio threshold according to the ratio of the first threshold adjusting parameter to the second threshold adjusting parameter;

wherein the second threshold adjustment parameter is positively correlated with the peak position t 0.

6. The method of claim 3, further comprising:

the cepstrum features comprise a zero-dimensional cepstrum C0 of the frame to be detected and a peak position t0 of the frame to be detected;

calculating to obtain a mute point ratio threshold according to the ratio of the first threshold adjusting parameter to the second threshold adjusting parameter; wherein the first threshold adjustment parameter is positively correlated with the zero-dimensional cepstrum C0; and the second threshold adjustment parameter is positively correlated with the peak position t 0.

7. The method according to claim 1, wherein the time window of predetermined length is a time window of 40-80 frame lengths, the frame lengths being 20-30 ms.

8. The method according to any one of claims 1 to 7, wherein determining the tail-point frame in the speech signal according to the mute point ratio and the mute point ratio threshold comprises:

when the ratio of the mute points is larger than the threshold value of the ratio of the mute points, determining the frame to be detected as a tail point frame;

and if not, continuously acquiring the next voice frame of the frame to be detected as the frame to be detected for detection until the ratio of the mute points is greater than the threshold value of the ratio of the mute points, and determining that the frame to be detected is the tail point frame.

9. The method of any one of claims 1 to 7, further comprising:

acquiring voice stream data before the tail point frame;

and recognizing the voice stream data and outputting response information.

10. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to perform:

the speech signal processing method according to any one of claims 1 to 7.

11. A computer-readable storage medium storing computer-executable instructions for performing:

the speech signal processing method according to any one of claims 1 to 7.