CN111768800A - Voice signal processing method, apparatus and storage medium - Google Patents

Voice signal processing method, apparatus and storage medium Download PDF

Info

Publication number
CN111768800A
CN111768800A CN202010581908.1A CN202010581908A CN111768800A CN 111768800 A CN111768800 A CN 111768800A CN 202010581908 A CN202010581908 A CN 202010581908A CN 111768800 A CN111768800 A CN 111768800A
Authority
CN
China
Prior art keywords
frame
detected
mute
voice
threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010581908.1A
Other languages
Chinese (zh)
Inventor
曹刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN202010581908.1A priority Critical patent/CN111768800A/en
Publication of CN111768800A publication Critical patent/CN111768800A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephone Function (AREA)

Abstract

The embodiment of the application relates to a voice signal processing method, equipment and a storage medium. The embodiment of the application comprises the following steps: acquiring audio features of a frame to be detected; obtaining the ratio of mute points in a time window with a preset length before the frame to be detected in the voice signal; determining a mute point ratio threshold according to the audio features; and judging whether the frame to be detected is a tail point frame or not according to the mute point ratio and the mute point ratio threshold. According to the embodiment of the application, the mute point occupation ratio in the preset length time window can be utilized, and the cepstrum characteristics of the current frame to be detected are used for dynamically adjusting the mute point occupation ratio threshold, so that the problem that the voice tail point detection is inaccurate due to the fixed mute point occupation ratio threshold is solved, and the accuracy and the real-time performance of the tail point frame detection are effectively improved.

Description

Voice signal processing method, apparatus and storage medium
Technical Field
The embodiments of the present application relate to, but not limited to, the field of computer technologies, and in particular, to a method, an apparatus, and a storage medium for processing a voice signal.
Background
With the development of artificial intelligence, speech recognition is a standard of a plurality of devices, and speech recognition takes speech as a research object, and a machine automatically recognizes and understands human spoken language through speech signal processing and pattern recognition.
The voice tail point detection plays a key role in voice recognition, namely, the voice tail point detection finds the tail point of voice in audio data, and the accuracy of the voice tail point detection plays a crucial role in the accuracy of voice recognition.
At present, the problem that the voice tail point is difficult to determine exists in voice tail point detection, so that the accuracy of voice recognition is greatly reduced.
Disclosure of Invention
The embodiment of the application provides a voice signal processing method, a device and a storage medium, which can improve the accuracy of voice tail point detection and recognition.
In a first aspect, an embodiment of the present application provides a speech signal processing method, including: acquiring audio features of a frame to be detected in a voice signal; obtaining the ratio of mute points in a time window with a preset length before a frame to be detected; obtaining a mute point ratio threshold according to the audio features; and determining a tail point frame in the voice signal according to the mute point ratio and the mute point ratio threshold.
In a second aspect, an embodiment of the present application provides an electronic device, including: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the speech signal processing method of the first aspect.
In a third aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-executable instructions are used to execute the speech signal processing method described in the first aspect.
The embodiment of the application comprises the following steps: acquiring audio features of a frame to be detected in a voice signal; obtaining the ratio of mute points in a time window with a preset length before the frame to be detected; determining a mute point ratio threshold according to the audio features; and determining that the frame to be detected is a tail point frame according to the mute point ratio and the mute point ratio threshold. According to the embodiment of the application, the mute point occupation ratio in the preset length time window can be utilized, and the cepstrum characteristics of the current frame to be detected are used for dynamically adjusting the mute point occupation ratio threshold, so that the problem that the voice tail point detection is inaccurate due to the fixed mute point occupation ratio threshold is solved, and the accuracy and the real-time performance of the tail point frame detection are effectively improved.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
Fig. 1 is a flowchart of a speech signal processing method according to an embodiment of the present application;
FIG. 2 is a flow chart of a speech signal processing method according to another embodiment of the present application;
FIG. 3 is a flow chart of a method for processing a speech signal according to another embodiment of the present application;
FIG. 4 is a flow chart of a speech signal processing method according to another embodiment of the present application;
FIG. 5 is a flow chart of a method for processing a speech signal according to another embodiment of the present application;
FIG. 6 is a flow chart of a speech recognition method provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of a speech endpoint detection apparatus according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a speech endpoint detection apparatus according to an embodiment of the present application;
fig. 9 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict.
It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
The detection of the voice tail point is an important link of voice recognition, is the most important step in the voice signal processing process, and the accuracy of the detection directly influences the speed and the result of the voice signal processing.
The voice signal processing method in the related art often has the problems that the voice tail point detection result is not accurate enough and the self-adaptive adjustment of the voice tail point detection cannot be realized. For example, in the related art, the detection of the voice tail point is easily affected by noise, so that the detection is not accurate enough; or, because the speaking speed of each person is different, the voice tail point detection method in the related art is prone to the problems of misidentification of the tail point frame or too slow recognition.
Based on this, the embodiment of the application provides a speech signal processing method, a device and a storage medium, which can utilize the ratio of the mute points in a time window with a preset length and dynamically adjust the threshold of the ratio of the mute points by using the cepstrum characteristics of the current frame to be detected, thereby overcoming the problem of inaccurate detection of the speech tail points existing in the fixed threshold of the ratio of the mute points, and effectively improving the accuracy and the real-time performance of the detection of the tail point frames.
It should be noted that, in the following embodiments, the terminal/device may be a mobile terminal device or a non-mobile terminal device. The mobile terminal equipment can be a mobile phone, a tablet computer, a notebook computer, a palm computer, vehicle-mounted terminal equipment, wearable equipment, a super mobile personal computer, a netbook, a personal digital assistant and the like; the non-mobile terminal equipment can be a personal computer, a television, a teller machine or a self-service machine and the like; the embodiments of the present application are not particularly limited.
In some embodiments, the electronic device may include that the electronic device may include a processor, an external memory interface, an internal memory, a Universal Serial Bus (USB) interface, a charging management module, a power management module, a battery, an antenna, a mobile communication module, a wireless communication module, an audio module, a speaker, a microphone, an earphone interface, a sensor module, a button, a motor, an indicator, a camera, a display screen, a Subscriber Identification Module (SIM) card interface, and the like. Wherein, the sensor module may include a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, etc.
In some embodiments, the electronic device may implement audio functions via an audio module, a speaker, a microphone, a headset interface, an application processor, and/or the like. Such as music playing, recording, etc.
In some embodiments, the audio module is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module may also be used to encode and decode audio signals.
In some embodiments, the audio module may be disposed in the processor, or a portion of the functional modules of the audio module may be disposed in the processor. Speakers, also known as "horns," are used to convert electrical audio signals into speech signals. The electronic device may listen to music through a speaker or listen to a hands-free conversation. A receiver, also called "handset", is used to convert an audio electrical signal into a speech signal. When the electronic equipment answers a call or voice information, the voice can be answered by placing the telephone receiver close to the ear of a person. Microphones, also known as "microphones" or "microphones", are used to convert speech signals into electrical signals. When making a call or sending voice information, a user can input a voice signal into the microphone by making a sound by approaching the microphone through the mouth of the user. The electronic device may be provided with at least one microphone.
In some embodiments, the electronic device may be provided with two microphones to achieve a noise reduction function in addition to collecting voice signals.
In some embodiments, the electronic device may further include three, four, or more microphones to collect a voice signal and reduce noise, and may further identify a sound source and implement a directional recording function.
In a first aspect, an embodiment of the present application provides a speech signal processing method for an electronic device.
In some embodiments, referring to fig. 1, a voice signal processing method may include:
step S1100, acquiring audio characteristics of a frame to be detected in a voice signal;
step S1200, obtaining the ratio of mute points in a time window with a preset length before a frame to be detected;
step S1300, obtaining a mute point ratio threshold according to the audio characteristics;
and step S1400, determining a tail point frame in the voice signal according to the mute point ratio and the mute point ratio threshold.
It should be noted that S1100, S1200, S1300, and S1400 only represent reference numerals, and should not be construed as limiting the order of steps. In particular, the sequence of the steps S1100 and S1200 is not limited.
In some embodiments, the frame to be detected in step S1100 is a speech frame to be detected, and the speech end point detection is to detect whether the speech frame to be detected is an end point frame.
In some embodiments, the audio features may be time domain features or frequency domain features, both of which may be extracted from a short-time audio, where the time domain features are features extracted directly on the basis of an original speech signal, and the frequency domain is a feature obtained by performing fourier transform on the original speech signal, converting the original speech signal into the frequency domain, and then extracting the features in the frequency domain, where the ordinate of each sampling point in the time domain represents an energy amplitude value of the point; the ordinate of each point on the frequency domain represents the energy level of its corresponding band within a short time frame.
In some embodiments, the audio features are cepstral features, which belong to the frequency domain of the audio features, it can be understood that the speech signal is inherently non-stationary, i.e. it changes dramatically at short intervals, e.g. 10ms, and the audio signal has short-term stability, so the sample points are usually processed in a short time or audio frame. The audio signal is composed of different energies carried by different frequencies, more information in the voice signal can be obtained in a frequency spectrum, the voice signal can be divided into a plurality of voice frames, each frame of voice signal corresponds to a frequency spectrum, the frequency spectrum is calculated through short-time fast Fourier transform, the frequency spectrum represents the relation between the frequency and the energy of the voice signal, the peak value in the frequency spectrum represents the main frequency component of the voice, the peak value of the voice can also be called as a formant, and the formant carries a voice identification attribute and can be used for identifying different voices. The speech frame to be detected may be subjected to the cepstral analysis described above.
In some embodiments, the cepstrum generation process is a homomorphic signal process, which aims to transform the nonlinear problem into a linear problem, the original speech signal is actually a convolution signal, the first step is to transform it into a multiplicative signal by convolution, it is understood that the convolution signal in the time domain is equivalent to the multiplicative signal in the frequency domain, the second step is to transform the multiplicative signal into an additive signal by taking the logarithm, and the third step is to perform inverse fourier transform to restore it into a convolution signal, where the discrete time domains are different from each other although both before and after the time domain sequence, and the output signal is called the cepstrum domain.
Wherein the cepstral function can be expressed as:
C(q)=|IF(log(s(f)))|^2
where q is an argument, f is an inverse frequency, s (f) is a fourier transform of the signal time domain signal s (t), log () is a logarithm, and IF is an inverse fourier transform.
It can be understood that, on one hand, the cepstrum can effectively remove noise interference, its separation characteristic makes periodic signals easier to detect, and the cepstrum gives higher weight to low amplitude components and smaller weight to high amplitude components during logarithmic conversion of power spectrum, the weighting result is favorable for highlighting periodic small signals, the general voice volume at the voice tail point is smaller, so that the voice amplitude is correspondingly smaller, the amplitude weighting can effectively improve the periodic signal intensity of the voice tail point frame, the cepstrum feature of the voice frame can effectively remove noise, and the cepstrum audio characteristic of the voice frame is more highlighted.
On the other hand, the cepstrum feature may depict the user's instantaneous speech rate corresponding to the frame to be detected, two important parameters in the cepstrum are zero-dimensional cepstrum C0 (i.e., C0 value in the cepstrum feature) and peak position t0 of the pitch period in the cepstrum, which are referred to as zero-dimensional cepstrum C0 and peak position t0 in the cepstrum feature for short, the slow speech is behind the cepstrum peak position t0 of the fast speech, so the pitch period of the slow speech is larger than the pitch period of the fast speech, otherwise, the peak position t0 of the fast speech is ahead of the peak position t0 of the slow speech, so the fast pitch period is smaller than the pitch period of the slow speech; meanwhile, the cepstrum has an obvious peak value at a position corresponding to a pitch period, so that the peak value characteristic in the cepstrum can be more prominent when the cepstrum is applied to tail point frame detection, and finally, an obvious mute frame threshold value can be obtained by combining zero-dimensional cepstrum C0 in the cepstrum characteristic.
In some embodiments, the preset-length time window in step S1200 is the total number of frames that are fixed frame lengths before the preset frame to be detected, where T represents the preset-length time window in this embodiment. For example, if the preset length time window T is 70, it represents the total number of frames with the length of 70 speech frames in the preset length time window T, and the 70 speech frames may include a mute frame or an un-mute frame, and assuming that the number of mute frames in the preset length time window is n, the mute point ratio is n/T.
It will be appreciated that the value of T is a fixed value, and for example, assume that the speech stream start frame is S0The frame to be detected is a speech frame S after a time window T with a preset length1When the frame to be detected is backwards moved by 1 voice frame length, presetting a length time windowThe total number of the voice frames of T is not changed, and the starting point of the time window T with the preset length is correspondingly moved backwards by 1 voice frame length.
In some embodiments, the method for processing a speech signal may further include a method for determining a speech stream start frame using an energy value of speech, and referring to fig. 2, before S1100, the method may further include:
step S1500, setting a voice energy threshold Elow;
step S1600, acquiring a voice frame of which the energy of the continuous MO is lower than a voice energy threshold Elow;
step S1700, acquiring a voice frame of which the voice energy of the MO in continuous preset quantity is higher than a voice energy threshold value Elow;
in step S1800, a speech frame where the speech energy value starts to increase is obtained, and the speech frame is regarded as a speech stream start frame S0.
It can be understood that energy is a carrier of a signal, and can be used to judge the existence of a speech signal, but a non-noise environment hardly exists, so the criterion for judging the beginning of a speech frame is not that an energy threshold is 0, but a speech energy threshold Elow is set, the speech energy threshold is obtained through a statistical experiment, or an empirical value, when the energy of the speech frames of the continuous preset number MO is less than the threshold, and the speech energy of the continuous preset number MO is higher than the speech energy threshold Elow, the speech frame where the speech energy value increases is regarded as a speech stream beginning frame S0
In some embodiments, the cepstrum feature may include one or more of a zero-dimensional cepstrum C0 of the frame to be detected and a peak position t0 of the frame to be detected, and the silence ratio threshold may be determined by the zero-dimensional cepstrum C0 of the frame to be detected or by the peak position t0 of the frame to be detected, or by the zero-dimensional cepstrum C0 of the frame to be detected and by the peak position t0 of the frame to be detected together.
In some embodiments, the step S1300 may include:
step 1310, determining a mute point ratio threshold according to the zero-dimensional cepstrum C0 of the frame to be detected.
In some embodiments, a mute point ratio threshold is calculated according to a ratio of the first threshold adjustment parameter and the second threshold adjustment parameter, wherein the mute point ratio threshold is used for judging whether the frame to be detected is a tail-point frame. It can be understood that the smaller the zero-dimensional cepstrum C0, the smaller the first threshold adjustment parameter, which affects the mute point occupancy threshold of the frame to be detected.
More specifically, referring to fig. 3, the step S1310 may include:
step S1311, determining a mute point ratio threshold according to the zero-dimensional cepstrum C0 and a first calculation formula;
in step S1312, the first calculation formula may include:
R=R2/R1
R1=a*C0;
wherein R is a mute point ratio threshold, R1Adjusting the parameter for the first threshold, R2Is a second threshold adjustment parameter, and a is a first threshold adjustment constant.
In some embodiments, the first threshold adjustment constant a and the second threshold adjustment parameter R2The value of a can be obtained by multiple statistical tests or an empirical value, and the range of the value of a is between 0 and 1.
And when the mute point ratio is larger than a first mute point ratio threshold value determined by the first calculation formula, judging that the frame to be detected is a tail point frame.
It can be understood that, assuming that the total number of the mute frames is n and the preset length time window is T, the calculation formula of the mute point ratio r in the preset length time window at this time is as follows:
r=n/T;
when R is greater than R, judging the current frame to be detected as a tail point frame; and if R is less than R, judging that the current frame to be detected is not the tail point frame.
If the frame to be detected is a tail point frame, intercepting voice stream data between a voice stream starting frame and the tail point frame, decoding and identifying the voice stream data to finally obtain response information, and if the frame to be detected is not the tail point frame, starting to detect the next frame of the current frame and continuing to detect the tail point frame.
It can be understood that, the determining of the mute frame in step S1200 described above uses a mute frame determination method, which mainly includes:
step S1210, extracting multi-threshold zero crossing rate of a frame of audio data, summing weighted values of the multi-threshold zero crossing rate to obtain total zero crossing rate Z, weighting and summing the total zero crossing rate Z, and setting thresholds T with different heights for multi-threshold zero crossing rate detection1、T2、T3And T is1<T2<T3Each frame is calculated to correspond to T by the following formula1、T2、T3Three kinds of threshold zero crossing rate Z1、Z2、Z3
Zn=∑{|sgn[x(n)-Tn]-sgn[x(n-1)-Tn]|+|sgn[x(n)+Tn]-sgn[x(n-1)+Tn]|}*w(n-w);
The total zero-crossing rate Z is represented by the following formula:
Z=W1*Z1+W2*Z2+W3*Z3
wherein, W1、W2、W3Is a zero-crossing rate weight, Z0Is the total zero-crossing rate cutoff value.
Step S1220, using multi-threshold zero crossing rate to weight and pre-judge the mute, if the total zero crossing rate Z of a frame of audio data is less than the set threshold Z0Judging the sound is mute;
step S1230, if the frame is not silent, extracting the composite feature of the frame of audio data;
wherein the composite features may include zero crossing rate, short-time energy value, mel-scale cepstral coefficients based on variable resolution spectrum;
step S1240, the compound characteristics of the audio are judged by using a two-classification support vector machine, and two types of results of normal voice and silence are obtained.
It can be understood that the calculation of the ratio r of the mute point in the time window with the preset length can be accurately performed only if the accuracy of the mute frame identification is improved, so that a numerical basis is provided for the accurate detection of the voice tail point in the embodiment, and the detection of the voice tail point is finally more accurate.
According to the embodiment, the mute point proportion threshold can be obtained through the zero-dimensional cepstrum C0 in the cepstrum characteristics and a corresponding calculation formula, and compared with the mute point proportion in a preset length time window, whether a frame to be detected is a tail point frame or not is finally and successfully judged, the detection and the identification of the voice tail point frame are more accurate, and the collected real-time frame in the voice stream has the effects of real-time performance and automatic adjustment.
In some embodiments, the cepstrum feature may include a peak position t0 of the frame to be detected, and the mute point occupation threshold is calculated according to a ratio of the first threshold adjustment parameter to the second threshold adjustment parameter; the second threshold value adjusting parameter is positively correlated with the peak value position t0, and the larger the peak value position t0 is, the larger the second threshold value adjusting parameter is; conversely, the smaller the peak position t0, the smaller the second threshold adjustment parameter.
The step S1300 may include:
step S1320, determining the mute point ratio threshold according to the peak position t0 of the frame to be detected.
More specifically, referring to fig. 4, step S1320 may include the steps of:
step S1321, determining a mute point ratio threshold according to the peak position t0 of the frame to be detected and a second calculation formula;
in step S1322, the second calculation formula may include:
R=R2/R1
R2=b*t0;
wherein R is a mute point ratio threshold, R1Adjusting the parameter for the first threshold, R2Is a second threshold adjustment parameter, b is a second threshold adjustment constant, t0 is the peak position, second threshold adjustment constants b and R1The value obtained by statistical test or empirical value can be selected, and the value range of b is between 0 and 1.
Specifically, the peak position t0 of the cepstrum is the peak position of the gene cycle of the frame to be detected, and is used for adjusting the second threshold adjustment parameter corresponding to the peak position t0, and the later the peak position is, the larger t0 is, so the peak position is also an important factor influencing the detection of the speech tail point.
Assuming that the total number of the mute frames is n and the preset length time window is T, the calculation formula of the ratio r of the mute points in the preset length time window at this time is as follows:
r=n/T;
and when R is greater than R, judging that the current frame to be detected is a tail point frame, and when R is less than R, judging that the current frame to be detected is not the tail point frame.
If the frame to be detected is a tail point frame, intercepting voice data between a voice stream starting frame and the tail point frame, decoding and identifying the voice data to finally obtain response information, and if the frame to be detected is not the tail point frame, starting to detect the next frame of the current frame and continuing to detect the tail point frame.
According to the embodiment, the mute point occupation ratio threshold can be obtained through the peak position t0 in the cepstrum feature and a corresponding calculation formula, and compared with the mute point occupation ratio in a preset length time window, whether a frame to be detected is a tail point frame or not is finally and successfully judged, the effect of more accurate detection of the voice tail point frame can be achieved, and the real-time frame in the voice stream is acquired, so that the effect of real-time performance and automatic adjustment is achieved.
In some embodiments, the cepstrum features of the frames to be detected may include a zero-dimensional cepstrum C0 of the frames to be detected and a peak position t0 of the frames to be detected; calculating to obtain a mute point ratio threshold according to the ratio of the first threshold adjusting parameter to the second threshold adjusting parameter; wherein the first threshold adjustment parameter is positively correlated with the zero-dimensional cepstrum C0; and, the second threshold adjustment parameter is positively correlated with the peak position t 0.
The step S1300 may include:
and S1330, determining a mute point ratio threshold according to the zero-dimensional cepstrum C0 of the frame to be detected and the peak position t0 of the frame to be detected.
More specifically, referring to fig. 5, step S1330 may include the following sub-steps:
step S1331, determining a mute point ratio threshold according to the zero-dimensional cepstrum C0, the peak position t0 and a third calculation formula;
in step S1332, the third calculation formula may include:
R=R2/R1
R1=a*C0;
R2=b*t0;
wherein R is a mute point ratio threshold, R1Adjusting the parameter for the first threshold, R2Is a second threshold adjustment parameter, and a and b are threshold adjustment constants.
Wherein, the value ranges of a and b are both 0 to 1.
It can be understood that C0 in the frame cepstrum feature to be detected affects the denominator in the third calculation formula, and the peak position t0 in the cepstrum feature affects the numerator in the third calculation formula, and the numerator and the denominator in the third calculation formula are all variables, and two variables are adjusted simultaneously, that is, C0 in the frame cepstrum feature to be detected adjusts the first threshold adjustment parameter, and the peak position t0 in the cepstrum feature adjusts the second threshold adjustment parameter, so that the final result has more real-time performance and reliability, and the robustness of voice tail point detection can be greatly improved.
It can be understood that the slower the user speech speed is, the smaller the corresponding zero-dimensional cepstrum C0 is, and the larger the peak position t0 of the cepstrum pitch period is, so that the smaller the R1 as the denominator is, the larger the R2 as the numerator is, the larger the mute point proportion threshold value is, and the larger the mute point proportion threshold value is, the more the mute point proportion threshold value is required to be, so that the voice tail point detection is finally more accurate.
Assuming that the total number of the mute frames is n and the preset length time window is T, the calculation formula of the ratio r of the mute points in the preset length time window at this time is as follows:
r=n/T;
and when R is greater than R, judging that the current frame to be detected is a tail point frame, and when R is less than R, judging that the current frame to be detected is not the tail point frame.
If the frame to be detected is a tail point frame, intercepting voice stream data between a voice stream starting frame and the tail point frame, decoding and identifying the voice stream data to finally obtain response information, and if the frame to be detected is not the tail point frame, starting to detect the next frame of the current frame and continuing to detect the tail point frame. The embodiment can reduce the sensitivity of a single parameter and increase the robustness of the system by adjusting the denominator and the numerator of the ratio.
In some embodiments, the preset length time window is a time window of 40-80 frame lengths, wherein the frame lengths are 20-30 ms.
In some embodiments, the frame length is selected to be 25ms, and taking a time window with a length of 40 frames as an example of the preset length time window, ideally, if there is no overlapping portion in each voice frame and each voice frame length is 25ms, the time length of the preset length time window is a product of the two, i.e. 40 × 25ms ═ 1 s.
Taking the frame length as 25ms, and taking a time window with 80 frame lengths as an example of a preset length time window, ideally, if each voice frame has no overlapping part and the length of each voice frame is 25ms, the time length of the preset length time window is the product of the two, that is, 80 × 25ms ═ 2 s.
Taking the frame length as 20ms, and taking a time window with 40 frame lengths as an example of a preset length time window, ideally, if each voice frame has no overlapping part and each voice frame length is 20ms, the time length of the preset length time window is a product of the two, that is, 40 × 20ms is 0.8 s.
Taking the frame length as 20ms, and taking a time window with 80 frame lengths as an example of a preset length time window, ideally, if each voice frame has no overlapping part and each voice frame length is 20ms, the time length of the preset length time window is a product of the two, that is, 80 × 20ms is 1.6 s.
Taking the frame length as 30ms, and taking a time window with 40 frame lengths as an example of a preset length time window, ideally, if each voice frame has no overlapping part and each voice frame length is 30ms, the time length of the preset length time window is a product of the two, that is, 40 × 30ms is 1.2 s.
Taking the frame length as 30ms, and taking a time window with 80 frame lengths as an example of a preset length time window, ideally, if each voice frame has no overlapping part and each voice frame length is 30ms, the time length of the preset length time window is a product of the two, that is, 80 × 30ms is 2.4 s.
It can be understood that, under the condition that the preset time window time length of 0.8s to 2.4s is included, more application scenarios can be covered, so that the application range of the embodiment is conveniently expanded.
It can be understood that the preset length time window may be set according to an empirical value or an experimental value, if the preset length time window is too short, when the voice stream of the user with slow speech speed is subjected to tail point detection, because the ratio of the mute points in the preset length time window is higher, the non-tail point frame is erroneously determined as the tail point frame (for example, a middle pause in the voice stream is erroneously detected as the tail point frame), and thus the accuracy of the voice tail point frame detection is reduced; if the time window with the preset length is too long, the real tail point frame of the user with high speech speed may be missed, and the length of the long frame is arranged between the real tail point frame and the frame to be detected at the position of the preset length window, so that long-interval delay occurs during speech recognition, not only too much system resources are occupied, but also the accuracy and the real-time performance of the speech tail point detection are reduced, and the user experience is influenced.
The embodiment of the application comprehensively considers the factors, and sets the preset length time window to be 40-80 frames according to the general situation of the voice stream generated by the user speaking, thereby effectively improving the accuracy and the real-time performance of voice tail point detection.
In some embodiments, step S1400 may include:
step S1410: when the mute point proportion is larger than the mute point proportion threshold value, determining the frame to be detected as a tail point frame;
step S1420: and if not, continuously acquiring the next voice frame of the frame to be detected as the frame to be detected for detection until the ratio of the mute points is greater than the threshold value of the ratio of the mute points, and determining that the frame to be detected is the tail point frame.
It can be understood that the purpose of the embodiment of the present application is to finally determine the speech end point, and the method for determining that the frame to be detected is the end point frame in step S1410 has been described in the foregoing embodiment, and is not described herein again. Meanwhile, the start frame in this embodiment is also the start frame S in the above embodiment0The determination method is not described herein. In step S1420At the beginning of frame S0After a time window of a preset length T, a speech frame S1If the ratio of the mute point to the voice frame is less than the threshold value of the ratio of the mute point to the voice frame to be detected, and the voice frame to be detected is judged to be a non-tail-point frame, the same tail-point frame detection method is used for the next frame to be detected S2Detecting; if it is judged S2If the frame is also a non-end frame, the steps S1100-S1400 are continued to detect S2Whether the next frame is a tail-point frame or not, and so on, and the frame S to be detected1The next tail point detection point is S in sequence2、S3、S4… … correspond to, S2、S3、S4… … are start frames S, respectively0The later T +1, T +2 and T +3 … … frames until the ratio of the mute point is greater than the threshold value of the ratio of the mute point, and the detected speech frame S is detectedNIf (N is 2, 3 … N) is the end point frame, the detection is ended.
According to the embodiment of the invention, the mute point ratio in the preset length time window, the peak value t0 in the cepstrum characteristic of the frame to be detected and the zero-dimensional cepstrum C0 are used for dynamically adjusting the mute point ratio threshold, and whether the frame to be detected is the tail point frame is determined according to the mute point ratio and the mute point ratio threshold, so that the defects that a related voice signal processing method is easily influenced by noise and the average speech speed is not easy to obtain can be overcome, the robustness of voice tail point detection in voice recognition can be greatly improved, the accuracy and the real-time performance of the voice tail point detection are ensured, and the user experience is improved.
In some embodiments, referring to fig. 6, the speech processing method may further include:
step S1900, acquiring voice stream data before a tail point frame;
step S2000, recognizing the voice stream data and outputting the response information.
The purpose of speech recognition is to allow a machine to understand what a person says, understand the person's intent, and react accordingly.
In step S1900, the voice stream data before the end-point frame is acquired.
In some embodiments, the determination is made at step S1900After the end frame of the speech is determined, the speech stream between the start frame and the end frame needs to be obtained in order to obtain the speech stream segment from the beginning of the speech stream to the end of the speech stream, for example, assume that the start frame of the speech stream is S0Determining the tail point frame of the voice stream as S1Intercepting the voice stream start frame S0To the end of the speech stream frame S1Voice stream in between, if S1Is judged as a non-end frame, and S1Next frame S2If the frame is a tail point frame, intercepting the voice stream start frame S0To the end of the speech stream frame S2And by analogy, acquiring voice stream data between a voice stream starting frame and a voice stream tail point frame.
Step S2000, recognizing the voice stream data and outputting the response information.
In some embodiments, the response message may be a text message, or may be an audio or video or other type of response message. For example, the voice tail point detection and voice recognition method in this embodiment is applied to voice wake-up, and then the voice stream between the voice stream start frame and the tail point frame is intercepted after the tail point frame is successfully determined by the voice signal processing method, and then the voice instruction corresponding to the voice stream recognized by the voice recognition method is executed by the voice recognition method, for example, if the user needs to wake up the terminal device interface by voice, the terminal device interface may be automatically lit up when the terminal device voice inputs "please light up the screen". For another example, the voice tail point detection and voice recognition method in this embodiment is applied to voice translation, and if the feedback of the response information is in an audio form, the voice stream between the start frame and the tail frame of the voice stream is intercepted after the tail frame is successfully determined by the voice signal processing method, and then the voice stream recognized by the voice recognition method is executed by the voice signal processing method, and the audio information corresponding to the voice stream is played on the voice translation interface. For another example, the voice endpoint detection and voice recognition method in this embodiment is applied to a driving assistant of a vehicle-mounted device, where the driving assistant is an application program installed on the vehicle-mounted device, and is capable of locating a location in real time, starting travel route navigation, knowing road conditions around in real time, and performing services such as intelligently controlling vehicle-mounted music and calls, so as to help a user to use the driving assistant through voice control, and when the user needs to use the driving assistant, the user inputs voice information using a microphone on the vehicle-mounted device, for example, "please go to a certain place", and then the driving assistant receives the voice signal and then performs voice endpoint detection and voice recognition to give response information, so as to facilitate the driver to notice whether the driving assistant receives a correct voice signal, and often the driving assistant gives a response in an audio or video form, for example, the method comprises the following steps of playing audio, planning a route to a certain place, asking for confirmation, or displaying a destination route on a display screen of the vehicle-mounted equipment after voice translation is recognized in a video mode, but the method is not limited to the application scenes, and can also be applied to the scenes of voice signal input such as a voice assistant and a smart sound box.
It can be understood that the speech recognition is to decode the cepstrum feature sequence corresponding to the intercepted effective frame sequence into the response information of the speech stream by using the acoustic and language models.
The speech recognition process is mainly divided into three parts:
the first part, extracting characteristic parameters, preprocessing the voice signal, extracting voice characteristic parameters to represent the voice signal;
a second part, training a model, and training an acoustic model and a language model by using the characteristic parameters;
and the third part is pattern matching, namely matching the characteristic parameters of the voice signal to be recognized with the trained model to generate a recognition effect.
The acoustic model is the bottom model of the recognition system, and is the most critical part of the speech recognition system. The objective of the acoustic model is to calculate the distance between the sequence of speech feature vectors and each pronunciation template. The acoustic model is designed to find the smallest recognition unit, which is closely related to the pronunciation characteristics of the language. The size of the recognition unit has a large influence on the size of the voice data amount, the recognition rate, and the flexibility. Wherein the recognition unit may be a word, a demi-syllable, or a phoneme.
The language model refers to some rules or grammatical structures in the language, and may also be a statistical model representing word or word context. Due to the complexity of the voice signals, the phenomenon of overlapping connection exists between different pronunciations, and even people can not distinguish the pronunciations if the front and the back of the single tone are not connected, so that the distinguishing degree of the acoustic model can be improved by means of the language model. The embodiment utilizes a relatively mature model, namely a statistical language model, which extracts the statistical relationship between different characters and words through statistics of a large number of text files.
It is understood that the voice stream data before the end point frame is obtained is the voice stream from the start frame to the end point frame of the obtained voice stream, for example, the start frame of the voice stream is marked as S0The time window of the preset length is T, namely S0Delaying speech frame S at length of T frames1As the frame to be detected, if it is detected that the speech frame is a tail-end frame by using the method in the first aspect, the speech stream intercepted in this embodiment is S0To S1If it is detected with the method of the first aspect that the frame is not a tail-point frame, detecting the next tail-point frame S2If it is a tail-point frame, if S2If it is a tail-point frame, the voice stream intercepted in this embodiment is S0To S2If it is detected with the method of the first aspect that the frame is not a tail-point frame, detecting the next tail-point frame S3And if the frame to be detected is not the tail point frame, continuing detecting the next frame to be detected by delaying the next frame until the frame to be detected is the tail point frame, and intercepting the voice stream between the start frame and the tail point frame and performing voice recognition processing.
The embodiment can realize that the voice stream between the start frame and the tail frame is intercepted after the tail frame is judged, can recognize the voice stream in real time and output corresponding response information.
In a second aspect, an embodiment of the present application provides an electronic device.
In some embodiments, referring to fig. 7, the electronic device may include one or more processors 110; a storage device 120 for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to perform: the speech signal processing method according to the first aspect.
In some embodiments, the electronic device may be a mobile terminal device or a non-mobile terminal device. The mobile terminal equipment can be a mobile phone, a tablet computer, a notebook computer, a palm computer, vehicle-mounted terminal equipment, wearable equipment, a super mobile personal computer, a netbook, a personal digital assistant and the like; the non-mobile terminal equipment can be a personal computer, a television, a teller machine or a self-service machine and the like; the embodiments of the present invention are not particularly limited.
For example, the electronic device in this embodiment is a mobile terminal device, a user uses a microphone and other devices on the mobile terminal to acquire a speech signal of a speaker, and performs endpoint detection and speech recognition by using the speech processing method of the first aspect, and the mobile terminal device provides response information, where the response information may be in a text form, or may be audio or video response information; for example, the voice signal processing method of the first aspect performs voice recognition and is applied to a driving assistant of a vehicle-mounted device, where the driving assistant is an application program installed on the vehicle-mounted device, and is capable of locating a location in real time, starting travel route navigation, knowing road conditions nearby in real time, and performing services such as intelligently controlling vehicle-mounted music and calls, so as to help a user to use the driving assistant through voice control, when the user needs to use the driving assistant, a microphone on the vehicle-mounted device is used to input voice information, for example, "please go to a certain place", the driving assistant obtains the voice signal and then performs voice tail detection and voice recognition to give response information, and in order to facilitate the driver to notice whether the driving assistant receives a correct voice signal, the driving assistant often gives a response in an audio or video form, for example, the audio playing "go somewhere, route has been planned, please confirm" or the recognized voice is translated and then the destination route is displayed on the display screen of the vehicle-mounted device in a video form, but the method is not limited to the above application scenarios, and can also be applied to the scenarios of voice signal input application, such as voice assistant, smart speaker, and the like. It should be noted that the above application scenarios have no requirement on the network, and can be implemented in the case of an offline network, a wired network, or a wireless network.
In some embodiments, the electronic device performs the speech signal processing method of steps S1100 to S2000 as in the first aspect embodiment described above.
According to the embodiment of the invention, the mute point ratio in the preset length time window, the peak value t0 in the cepstrum characteristic of the frame to be detected and the zero-dimensional cepstrum C0 are used for dynamically adjusting the mute point ratio threshold, and whether the frame to be detected is the tail point frame is determined according to the mute point ratio and the mute point ratio threshold, so that the defects that a related voice signal processing method is easily influenced by noise and the average speech speed is not easy to obtain can be overcome, the robustness of voice tail point detection in voice recognition can be greatly improved, the accuracy and the real-time performance of the voice tail point detection are ensured, and the user experience is improved.
In a third aspect, embodiments of the present application provide a computer-readable storage medium.
In some embodiments, the computer-readable storage medium stores computer-executable instructions for performing: a speech signal processing method as in the first aspect.
In some embodiments, a computer-readable storage medium stores computer-executable instructions for performing the speech signal processing method in steps S1100 to S2000 as in the first aspect embodiment described above.
According to the embodiment of the invention, the mute point ratio in the preset length time window, the peak value t0 in the cepstrum characteristic of the frame to be detected and the zero-dimensional cepstrum C0 are used for dynamically adjusting the mute point ratio threshold, and whether the frame to be detected is the tail point frame is determined according to the mute point ratio and the mute point ratio threshold, so that the defects that a related voice signal processing method is easily influenced by noise and the average speech speed is not easy to obtain can be overcome, the robustness of voice tail point detection in voice recognition can be greatly improved, the accuracy and the real-time performance of the voice tail point detection are ensured, and the user experience is improved.
In a fourth aspect, an embodiment of the present application provides a speech processing apparatus.
In some embodiments, referring to fig. 8, the speech processing apparatus may include:
the audio extraction module 210 is configured to obtain audio features of a frame to be detected in a speech signal;
the audio processing module 220 is connected to the audio extracting module 210, and configured to obtain a ratio of the mute point in a time window of a preset length before the frame to be detected, determine a threshold of the ratio of the mute point according to the audio feature of the frame to be detected, and determine that the frame to be detected is a tail-point frame according to the ratio of the mute point and the threshold of the ratio of the mute point.
In some embodiments, the audio processing module 220 performs extraction of cepstrum features on the audio features obtained by the audio extraction module 210, where the gene cycle peak position t0 and the zero-dimensional cepstrum C0 feature of each speech frame in the extracted cepstrum features are important detection parameters for performing tail point detection. When tail point detection is carried out, each frame to be detected is detected by a time window with a preset length forward, namely, the mute point proportion in the time window with the preset length in front of the frame to be detected is obtained, then, the mute point proportion threshold value is dynamically adjusted according to the peak position t0 and the zero-dimensional cepstrum C0 of the gene period in the cepstrum characteristics of the current frame to be detected, and finally, whether the frame to be detected is the tail point frame or not is judged according to the mute point proportion and the mute point proportion threshold value.
In some embodiments, the speech processing apparatus may further include:
and the voice signal capturing module is connected with the audio extraction module and is used for receiving the analog voice signal, converting the analog voice signal into a digital voice signal and transmitting the digital voice signal to the audio extraction module.
Specifically, the voice signal capturing module is a device that the electronic device converts an analog voice signal into a digital voice signal through a microphone and the like, the sampling rate is generally 16KHZ, the depth is 8 bits, a common audio segment with a length of 25ms and a front-back overlapping length of 10ms is used as a voice frame, namely a minimum unit for feature extraction, and each voice frame can comprise a plurality of sampling points.
In some embodiments, referring to fig. 9, the speech processing apparatus may further include:
the obtaining module 230 is connected to the audio processing module 220, and is configured to obtain voice stream data before the end point frame.
In some embodiments, when the audio processing module 220 detects that the ratio of the mute point of the frame to be detected exceeds the threshold value of the ratio of the mute point, the frame to be detected is determined to be the tail frame of the voice stream, and at this time, the obtaining module 230 intercepts the voice stream data between the start frame and the tail frame of the voice stream.
The identifying module 240 is connected to the acquiring module 230, and is configured to identify the voice stream data transmitted by the acquiring module 230 and output response information.
In some embodiments, the recognition module 240 may include an acoustic model and a language model, and may decode the acquired voice stream data into corresponding response information.
The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media may include, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to one of ordinary skill in the art.
While the preferred embodiments of the present invention have been described, the present invention is not limited to the above embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and such equivalent modifications or substitutions are included in the scope of the present invention defined by the claims.

Claims (11)

1. A speech signal processing method, comprising:
acquiring audio features of a frame to be detected in a voice signal;
obtaining the ratio of mute points in a time window with a preset length before the frame to be detected;
determining a mute point ratio threshold according to the audio features;
and determining a tail point frame in the voice signal according to the mute point ratio and the mute point ratio threshold.
2. The method of claim 1, wherein the audio features are cepstral features.
3. The method of claim 2, wherein the cepstral features comprise one or more of:
the zero-dimensional cepstrum C0 of the frame to be detected and the peak position t0 of the frame to be detected.
4. The method according to claim 3, wherein when the cepstral features include the zero-dimensional cepstrum C0 of the frame to be detected;
the obtaining the mute point ratio threshold according to the audio features includes:
calculating to obtain a mute point ratio threshold according to the ratio of the first threshold adjusting parameter to the second threshold adjusting parameter; wherein the first threshold adjustment parameter is positively correlated with the zero-dimensional cepstrum C0.
5. The method according to claim 3, wherein the cepstral features comprise a peak position t0 of the frame to be detected;
the obtaining the mute point ratio threshold according to the audio features includes:
calculating to obtain a mute point ratio threshold according to the ratio of the first threshold adjusting parameter to the second threshold adjusting parameter;
wherein the second threshold adjustment parameter is positively correlated with the peak position t 0.
6. The method of claim 3, further comprising:
the cepstrum features comprise a zero-dimensional cepstrum C0 of the frame to be detected and a peak position t0 of the frame to be detected;
the obtaining the mute point ratio threshold according to the audio features includes:
calculating to obtain a mute point ratio threshold according to the ratio of the first threshold adjusting parameter to the second threshold adjusting parameter; wherein the first threshold adjustment parameter is positively correlated with the zero-dimensional cepstrum C0; and the second threshold adjustment parameter is positively correlated with the peak position t 0.
7. The method according to claim 1, wherein the time window of predetermined length is a time window of 40-80 frame lengths, the frame lengths being 20-30 ms.
8. The method according to any one of claims 1 to 7, wherein determining the tail-point frame in the speech signal according to the mute point ratio and the mute point ratio threshold comprises:
when the ratio of the mute points is larger than the threshold value of the ratio of the mute points, determining the frame to be detected as a tail point frame;
and if not, continuously acquiring the next voice frame of the frame to be detected as the frame to be detected for detection until the ratio of the mute points is greater than the threshold value of the ratio of the mute points, and determining that the frame to be detected is the tail point frame.
9. The method of any one of claims 1 to 7, further comprising:
acquiring voice stream data before the tail point frame;
and recognizing the voice stream data and outputting response information.
10. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to perform:
the speech signal processing method according to any one of claims 1 to 7.
11. A computer-readable storage medium storing computer-executable instructions for performing:
the speech signal processing method according to any one of claims 1 to 7.
CN202010581908.1A 2020-06-23 2020-06-23 Voice signal processing method, apparatus and storage medium Pending CN111768800A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010581908.1A CN111768800A (en) 2020-06-23 2020-06-23 Voice signal processing method, apparatus and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010581908.1A CN111768800A (en) 2020-06-23 2020-06-23 Voice signal processing method, apparatus and storage medium

Publications (1)

Publication Number Publication Date
CN111768800A true CN111768800A (en) 2020-10-13

Family

ID=72722110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010581908.1A Pending CN111768800A (en) 2020-06-23 2020-06-23 Voice signal processing method, apparatus and storage medium

Country Status (1)

Country Link
CN (1) CN111768800A (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000276200A (en) * 1999-03-26 2000-10-06 Matsushita Electric Works Ltd Voice quality converting system
CN1758331A (en) * 2005-10-31 2006-04-12 浙江大学 Quick audio-frequency separating method based on tonic frequency
JP2008176155A (en) * 2007-01-19 2008-07-31 Kddi Corp Voice recognition device and its utterance determination method, and utterance determination program and its storage medium
US20090177466A1 (en) * 2007-12-20 2009-07-09 Kabushiki Kaisha Toshiba Detection of speech spectral peaks and speech recognition method and system
US20090222258A1 (en) * 2008-02-29 2009-09-03 Takashi Fukuda Voice activity detection system, method, and program product
US20130197911A1 (en) * 2010-10-29 2013-08-01 Anhui Ustc Iflytek Co., Ltd. Method and System For Endpoint Automatic Detection of Audio Record
US20130322644A1 (en) * 2012-05-31 2013-12-05 Yamaha Corporation Sound Processing Apparatus
WO2017012242A1 (en) * 2015-07-22 2017-01-26 百度在线网络技术(北京)有限公司 Voice recognition method and apparatus
CN108305639A (en) * 2018-05-11 2018-07-20 南京邮电大学 Speech-emotion recognition method, computer readable storage medium, terminal
US20180350388A1 (en) * 2017-05-31 2018-12-06 International Business Machines Corporation Fast playback in media files with reduced impact to speech quality
CN110349598A (en) * 2019-07-15 2019-10-18 桂林电子科技大学 A kind of end-point detecting method under low signal-to-noise ratio environment
CN111105782A (en) * 2019-11-27 2020-05-05 深圳追一科技有限公司 Session interaction processing method and device, computer equipment and storage medium
CN114155839A (en) * 2021-12-15 2022-03-08 科大讯飞股份有限公司 Voice endpoint detection method, device, equipment and storage medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000276200A (en) * 1999-03-26 2000-10-06 Matsushita Electric Works Ltd Voice quality converting system
CN1758331A (en) * 2005-10-31 2006-04-12 浙江大学 Quick audio-frequency separating method based on tonic frequency
JP2008176155A (en) * 2007-01-19 2008-07-31 Kddi Corp Voice recognition device and its utterance determination method, and utterance determination program and its storage medium
US20090177466A1 (en) * 2007-12-20 2009-07-09 Kabushiki Kaisha Toshiba Detection of speech spectral peaks and speech recognition method and system
US20090222258A1 (en) * 2008-02-29 2009-09-03 Takashi Fukuda Voice activity detection system, method, and program product
US20130197911A1 (en) * 2010-10-29 2013-08-01 Anhui Ustc Iflytek Co., Ltd. Method and System For Endpoint Automatic Detection of Audio Record
US20130322644A1 (en) * 2012-05-31 2013-12-05 Yamaha Corporation Sound Processing Apparatus
WO2017012242A1 (en) * 2015-07-22 2017-01-26 百度在线网络技术(北京)有限公司 Voice recognition method and apparatus
US20180350388A1 (en) * 2017-05-31 2018-12-06 International Business Machines Corporation Fast playback in media files with reduced impact to speech quality
CN108305639A (en) * 2018-05-11 2018-07-20 南京邮电大学 Speech-emotion recognition method, computer readable storage medium, terminal
CN110349598A (en) * 2019-07-15 2019-10-18 桂林电子科技大学 A kind of end-point detecting method under low signal-to-noise ratio environment
CN111105782A (en) * 2019-11-27 2020-05-05 深圳追一科技有限公司 Session interaction processing method and device, computer equipment and storage medium
CN114155839A (en) * 2021-12-15 2022-03-08 科大讯飞股份有限公司 Voice endpoint detection method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘思伟等: "基于 G.729 的自适应实时语音活动检测方法研究", 计算机工程与应用, 31 December 2007 (2007-12-31), pages 57 - 60 *
杜雨轩: "多人场景下的说话人分割聚类研究", 万方硕士论文, 16 January 2024 (2024-01-16), pages 12 *

Similar Documents

Publication Publication Date Title
CN105161093B (en) A kind of method and system judging speaker&#39;s number
CN110364143B (en) Voice awakening method and device and intelligent electronic equipment
CN108346425B (en) Voice activity detection method and device and voice recognition method and device
CN107799126A (en) Sound end detecting method and device based on Supervised machine learning
US8442833B2 (en) Speech processing with source location estimation using signals from two or more microphones
EP2083417B1 (en) Sound processing device and program
CN110853664B (en) Method and device for evaluating performance of speech enhancement algorithm and electronic equipment
KR101616112B1 (en) Speaker separation system and method using voice feature vectors
CN111508498A (en) Conversational speech recognition method, system, electronic device and storage medium
WO2015034633A1 (en) Method for non-intrusive acoustic parameter estimation
US20060100866A1 (en) Influencing automatic speech recognition signal-to-noise levels
CN111261145B (en) Voice processing device, equipment and training method thereof
CN113192535B (en) Voice keyword retrieval method, system and electronic device
US20190348032A1 (en) Methods and apparatus for asr with embedded noise reduction
CN109361995A (en) A kind of volume adjusting method of electrical equipment, device, electrical equipment and medium
CN109994129B (en) Speech processing system, method and device
CN111081275B (en) Terminal processing method and device based on sound analysis, storage medium and terminal
JP4301896B2 (en) Signal analysis device, voice recognition device, program, recording medium, and electronic device
CN112116909A (en) Voice recognition method, device and system
US20210201928A1 (en) Integrated speech enhancement for voice trigger application
CN111613223B (en) Voice recognition method, system, mobile terminal and storage medium
CN112802498A (en) Voice detection method and device, computer equipment and storage medium
Varela et al. Combining pulse-based features for rejecting far-field speech in a HMM-based voice activity detector
EP4084002B1 (en) Information processing method, electronic equipment, storage medium, and computer program product
CN111768800A (en) Voice signal processing method, apparatus and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination