CN112489692A - Voice endpoint detection method and device - Google Patents

Voice endpoint detection method and device Download PDF

Info

Publication number
CN112489692A
CN112489692A CN202011213344.2A CN202011213344A CN112489692A CN 112489692 A CN112489692 A CN 112489692A CN 202011213344 A CN202011213344 A CN 202011213344A CN 112489692 A CN112489692 A CN 112489692A
Authority
CN
China
Prior art keywords
voice
speech
frame
model
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011213344.2A
Other languages
Chinese (zh)
Inventor
刘羽辰
李健
武卫东
陈明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sinovoice Technology Co Ltd
Original Assignee
Beijing Sinovoice Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sinovoice Technology Co Ltd filed Critical Beijing Sinovoice Technology Co Ltd
Priority to CN202011213344.2A priority Critical patent/CN112489692A/en
Publication of CN112489692A publication Critical patent/CN112489692A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise

Abstract

The embodiment of the application relates to a voice endpoint detection method and a voice endpoint detection device, wherein the method comprises the following steps: the method comprises the steps of extracting voice features of voice to be detected to obtain a plurality of feature frames, respectively calculating a pre-trained voice model, a pre-trained non-voice model and a voice likelihood value and a non-voice likelihood value of each feature frame to judge whether each feature frame is a voice frame or a non-voice frame, respectively carrying out self-adaptive updating on the voice model and the non-voice model based on the voice frame and the non-voice frame, and judging an end point of the voice to be detected by using the updated model. The voice endpoint detection method can accurately detect the voice and the non-voice in the specific scene after using less data volume adjustment in the specific scene.

Description

Voice endpoint detection method and device
Technical Field
The embodiment of the application relates to the technical field of voice recognition, in particular to a voice endpoint detection method and device.
Background
Endpoint Detection, also called Voice Activity Detection (VAD), has the purpose of distinguishing between speech and non-speech areas. It is commonly understood that endpoint detection is to accurately locate the start point and the end point of a speech (also called endpoint) from a noisy speech, remove the silence, remove the noise, and find a piece of content that the speech is really valid.
One of the important links of speech recognition in endpoint detection is an indispensable part of the endpoint detection, and the accuracy of speech recognition is directly affected by the quality of the endpoint detection. A well-behaved and excellent endpoint detection technique that can detect neither too little nor too much. If the detection is less, the voice information is lost, and the recognition is missed; excessive detection may cause the speech header to contain noise, which may cause misrecognition or multiple recognitions, as well as increase the real-time rate of speech recognition. It follows that endpoint detection is crucial to the overall flow of speech recognition.
The existing voice endpoint detection generally extracts voice features to judge the extracted features so as to judge voice or non-voice therein, or establishes an acoustic model to classify voice or decode the voice so as to obtain global information to judge voice or non-voice therein. However, both of the two methods can only judge the environmental noise and the voice of a single scene, and cannot be applied to other scenes, and when the scene is changed, the model used needs to be trained again to adapt to the changed noise in the scene.
Disclosure of Invention
Based on the above problems, embodiments of the present application provide a method and an apparatus for detecting a voice endpoint, which aim to solve the disadvantages of inaccurate detection result and poor universality of the existing voice endpoint detection method.
A first aspect of an embodiment of the present application provides a method for detecting a voice endpoint, where the method includes:
performing voice feature extraction on voice to be detected to obtain a plurality of feature frames;
calculating a pre-trained speech model and a speech likelihood value of each feature frame and calculating a pre-trained non-speech model and a non-speech likelihood value of each feature frame;
calculating a signal-to-noise likelihood ratio of each feature frame based on the speech likelihood value and the non-speech likelihood value;
judging whether the frame corresponding to each characteristic frame in the voice to be detected is a voice frame or a non-voice frame based on the signal-noise likelihood ratio;
adaptively updating the speech model based on all the speech frames and adaptively updating the non-speech model based on all the non-speech frames;
respectively calculating a final speech likelihood value and a final non-speech likelihood value of each feature frame by using the updated speech model and the updated non-speech model;
and calculating to obtain a final signal-to-noise likelihood ratio of each feature frame based on the final speech likelihood value and the final non-speech likelihood value, and detecting the end point of the speech to be detected based on the final signal-to-noise likelihood ratio.
Optionally, adaptively updating the speech model based on all the speech frames, and adaptively updating the non-speech model based on all the non-speech frames includes:
stopping the self-adaptive updating of the voice model and the non-voice model when the condition is met;
the conditions include at least one of:
until the result of the signal-to-noise likelihood ratio of the feature frame is stable;
until the updating error of the voice model or the non-voice model is smaller than a preset threshold value.
Optionally, the calculating the speech likelihood value of the pre-trained speech model and each of the feature frames and the calculating the non-speech likelihood value of the pre-trained non-speech model and each of the feature frames are calculated by any one of the following algorithms, where the algorithm includes:
maximum a posteriori probability algorithm, maximum likelihood estimation algorithm, maximum expectation algorithm.
Optionally, the method further includes:
acquiring training data, wherein the training data comprises non-voice data and voice data;
training a first preset Gaussian mixture model by using the non-voice data to obtain the non-voice model;
training a second preset Gaussian mixture model by using the voice data to obtain the voice model;
adaptively updating the speech model based on all the speech frames and adaptively updating the non-speech model based on all the non-speech frames, wherein an algorithm of the adaptive updating comprises any one of the following:
maximum a posteriori probability MAP, maximum likelihood linear regression MLLR.
Optionally, the extracting the voice feature of the voice to be detected includes:
pre-emphasis is carried out on the voice to be detected, the high-frequency part of the voice to be detected is promoted, and the pre-emphasized voice to be detected is obtained;
dividing the pre-emphasized voice to be detected into a plurality of initial voice frames;
multiplying each initial voice frame by a Hamming window to obtain an intermediate processed signal;
performing fast Fourier transform on the intermediate processed signal to obtain the frequency spectrum and energy distribution of the voice frame;
filtering the energy spectrum through a set of Mel-scale triangular filter sets to smooth the frequency spectrum of the speech frame;
and calculating the logarithmic energy output by each triangular filter bank, obtaining a Mel cepstrum MFCC coefficient by Discrete Cosine Transform (DCT) of the logarithmic energy, and taking the Mel cepstrum MFCC coefficient as the voice feature.
Optionally, the length of a frame in the feature frame, the speech frame, or the non-speech frame is 20 ms.
Optionally, in the multiple initial speech frames, there is no overlap between any two adjacent initial speech frames.
A second aspect of the embodiments of the present application provides a voice endpoint detection apparatus, including:
the feature extraction module is used for extracting voice features of the voice to be detected to obtain a plurality of feature frames;
the speech likelihood value calculation module is used for calculating a pre-trained speech model and the speech likelihood value of each feature frame;
the non-voice likelihood value calculation module is used for calculating a pre-trained non-voice model and a non-voice likelihood value of each feature frame;
a signal-to-noise likelihood ratio calculation module for calculating a signal-to-noise likelihood ratio of each feature frame based on the speech likelihood value and the non-speech likelihood value;
the voice frame judging module is used for judging whether a frame corresponding to each characteristic frame in the voice to be detected is a voice frame or a non-voice frame based on the signal-to-noise likelihood ratio;
the self-adaptive updating module is used for carrying out self-adaptive updating on the voice model based on all the voice frames and carrying out self-adaptive updating on the non-voice model based on all the non-voice frames;
an endpoint detection module, configured to calculate a final signal-to-noise likelihood ratio of each feature frame based on the final speech likelihood value and the final non-speech likelihood value, and detect an endpoint of the speech to be detected based on the final signal-to-noise likelihood ratio;
and respectively calculating the final speech likelihood value and the final non-speech likelihood value of each feature frame by using the updated speech model and the updated non-speech model.
Optionally, the adaptive update module includes:
the condition matching submodule is used for stopping the self-adaptive updating of the voice model and the non-voice model when the condition is met;
the conditions include at least one of:
until the result of the signal-to-noise likelihood ratio of the feature frame is stable;
until the updating error of the voice model or the non-voice model is smaller than a preset threshold value.
Optionally, in the speech likelihood value calculation module and the non-speech likelihood value calculation module, calculating the speech likelihood value of the pre-trained speech model and each of the feature frames and calculating the non-speech likelihood value of the pre-trained non-speech model and each of the feature frames are calculated by any one of the following algorithms, including:
maximum a posteriori probability algorithm, maximum likelihood estimation algorithm, maximum expectation algorithm.
According to the voice endpoint detection method and device, firstly, voice features of voice to be detected are extracted, then, voice frames and non-voice frames are obtained through frame-by-frame judgment by using a voice model and a non-voice model, the voice model and the non-voice model are respectively subjected to self-adaptive updating based on the voice frames and the non-voice frames, and the endpoint of the voice to be detected is detected by using the updated voice model and the updated non-voice model. The training process enables the initial voice model and the initial non-voice model to be capable of accurately detecting the voice and the non-voice under the specific scene after the initial voice model and the initial non-voice model are adjusted by using less data volume under the specific scene, and rapid training convergence and decoding calculation are facilitated. Meanwhile, self-adaptive training is carried out aiming at specific scenes, so that a voice model and a non-voice model can be changed according to the specific scenes, the voice endpoint detection method and the voice endpoint detection device can be applied to more noise scenes, target voice can be detected in various noise scenes, and the accuracy rate of voice endpoint detection can be improved in all applied scenes.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a flowchart of a voice endpoint detection method according to an embodiment of the present application;
FIG. 2 is a flow chart of adaptive update proposed by an embodiment of the present application;
fig. 3 is a flowchart illustrating speech feature extraction performed on a speech to be detected according to an embodiment of the present application;
fig. 4 is a schematic diagram of a voice endpoint detection apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
First, a method of voice endpoint detection in the related art will be described.
Existing voice endpoint detection VAD algorithms can be roughly classified into three categories: threshold-based VAD, VAD as classifier, model VAD. Threshold-based VAD: the purpose of distinguishing voice from non-voice is achieved by extracting characteristics of a time domain (short-time energy, short-time zero crossing rate and the like) or a frequency domain (MFCC, spectral entropy and the like) and reasonably setting a threshold, and the method is a traditional VAD method. VAD as classifier: the voice detection can be regarded as a two-classification problem of voice and non-voice, and a classifier is trained by a machine learning method to achieve the purpose of voice detection. Model VAD: the speech segments and the non-speech segments can be distinguished by global information on the basis of decoding by using a complete acoustic model (the granularity of the modeling unit can be very coarse).
However, the variety of noises is quite varied, and in many cases ambient speech is considered to be a noise (this is because we need to extract the main segments of a conversation, such as two people talking in the vegetable market, the non-active portion may be charged with the cry out, which would affect speech detection if the cry out of others were not cut). A fixed VAD classification method is difficult to take care of all situations, and it is obviously redundant and complicated to dedicate a VAD method for each environment. The limitations of the conventional VAD method are apparent at this time.
Referring to fig. 1, fig. 1 is a flowchart of a voice endpoint detection method according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
step S101, voice feature extraction is carried out on voice to be detected to obtain a plurality of feature frames.
The speech features in the speech to be detected are extracted for later judgment, and the speech features may be power-normalized Cepstral Coefficients (PNCC) features, Mel-Frequency Cepstral Coefficients (MFCC) features, FilterBank (Fblank) features, and the like. And the step scale of the feature extraction is frames, and the features are extracted frame by frame to obtain a plurality of features with the length of the frames.
Step S102, calculating a pre-trained speech model and a speech likelihood value of each feature frame, and calculating a pre-trained non-speech model and a non-speech likelihood value of each feature frame.
Before carrying out endpoint detection, training a non-speech model by using background data, and training a speech model by using speech data; the voice data refers to voice data of a target person to be detected, and the background data refers to other voice data which does not contain the voice of the target person to be detected, such as speaking voice of other people, automobile horn voice, bird call voice and the like. The initial models are probability models in various forms such as an Euler model, a Laplace model and a Gaussian model, different models select corresponding training methods to obtain the speech models and the non-speech models which can be used for endpoint detection based on the background data and the speech data, and the speech models and the non-speech models obtained by training respectively represent the feature distribution between the trained speech data and the trained background data.
And calculating the speech likelihood values between the feature frames and the speech models one by one, wherein the larger the speech likelihood value is, the closer the distribution of the feature frames is to the distribution of the speech models is.
And calculating the non-speech likelihood values between the feature frames and the non-speech models one by one, wherein the larger the non-speech likelihood value is, the closer the distribution of the feature frames is to the distribution of the non-speech models is represented.
And step S103, calculating the signal-to-noise likelihood ratio of each feature frame based on the voice likelihood value and the non-voice likelihood value.
The signal-to-noise likelihood ratio of each feature frame, which represents the ratio between the signal likelihood and the noise likelihood of the frame, indicating whether the frame is more prone to signal (i.e., speech) or noise (i.e., non-speech), may be calculated based on the speech likelihood values and the non-speech likelihood values calculated in step S102 above. In one embodiment of the present application, the signal-to-noise likelihood ratio of the feature frame is obtained by directly dividing the speech likelihood value of the feature frame by the non-speech likelihood value of the feature frame.
And step S104, judging that the frame corresponding to each characteristic frame in the voice to be detected is a voice frame or a non-voice frame based on the signal-noise likelihood ratio.
Based on the signal-to-noise likelihood ratio of each feature frame obtained in step S103, when the signal-to-noise likelihood ratio of a feature frame is greater than a preset threshold, it may be determined that a frame in the original to-be-detected speech corresponding to the feature frame is generated as a speech frame, and when the signal-to-noise likelihood ratio of a feature frame is less than the preset threshold, it may be determined that a frame in the original to-be-detected speech corresponding to the feature frame is generated as a non-speech frame.
In an embodiment of the present application, the threshold is 1, and a frame in the speech to be detected corresponding to a characteristic frame having a signal-to-noise likelihood ratio greater than 1 is a speech frame, otherwise, the frame is a non-speech frame.
Step S105, performing adaptive updating on the voice model based on all the voice frames, and performing adaptive updating on the non-voice model based on all the non-voice frames.
As shown in fig. 2, the steps of the adaptive update are: s1501, all the voice frames are spliced to obtain voice segments, all the non-voice frames are spliced to obtain non-voice segments, each segment can be considered to be single because only voice or non-voice synthesis is performed on each segment, in one embodiment of the application, all the voice frames are spliced to form a voice segment, the voice frames are spliced to form a plurality of voice segments with the length of 3-5S, and similarly, all the non-voice frames are spliced to form a plurality of non-voice segments with the length of 3-5S.
And performing adaptive updating on the voice model by using the spliced voice segments, and performing adaptive updating on the non-voice model by using the spliced non-voice segments.
S10502, using the spliced voice segment to update the voice model for the first time, and using the spliced non-voice segment to update the non-voice model for the first time.
It should be noted that, since the speech model and the non-speech model of the present application can be trained by selecting different initial models, the algorithms and processes for how to implement the updating specifically correspond to the speech model and the non-speech model selected from the different initial models will also be different.
And S10503, calculating the speech likelihood value of each feature frame by using the speech model updated for the first time, and calculating the non-speech likelihood value of the speech of each feature frame by using the non-speech model updated for the first time.
S10504, judging whether the conditions are met, if not, continuing to execute the updating step; if so, the update is exited.
In one embodiment of the present application, the updating step that is continuously performed includes:
and S10505, calculating a new round of signal-to-noise likelihood ratio of the voice of each characteristic frame based on the voice likelihood value and the non-voice likelihood value.
S10506, judging the frame of each characteristic frame in the voice to be detected as a voice frame or a non-voice frame based on the new round of signal-to-noise likelihood ratio, and jumping to the step S1501 after the judgment is finished.
After the speech model and the non-speech model are adaptively updated in step S105.
And S106, respectively calculating the final speech likelihood value and the final non-speech likelihood value of each feature frame by using the updated speech model and the updated non-speech model.
And finally calculating the speech likelihood value of each feature frame by using the speech model after the adaptive updating, and finally calculating the non-speech likelihood value of each feature frame by using the non-speech model after the adaptive updating.
And S107, calculating to obtain a final signal-to-noise likelihood ratio of each feature frame based on the final speech likelihood value and the final non-speech likelihood value, and detecting an end point of the speech to be detected based on the final signal-to-noise likelihood ratio.
And calculating the signal-to-noise likelihood ratio of each feature frame based on the final speech likelihood value and the final non-speech likelihood value calculated in the step S106, and finally judging that the frame in the speech to be detected corresponding to each feature frame is a speech frame or a non-speech frame based on the signal-to-noise likelihood ratio.
And detecting the endpoint of the voice to be detected based on the distribution of the voice frame and the non-voice frame obtained by the last judgment.
The voice endpoint detection method extracts voice characteristics of voice to be detected to obtain a plurality of characteristic frames; calculating a pre-trained speech model and a speech likelihood value of each feature frame and calculating a pre-trained non-speech model and a non-speech likelihood value of each feature frame; calculating a signal-to-noise likelihood ratio of each feature frame based on the speech likelihood value and the non-speech likelihood value; judging whether the frame corresponding to each characteristic frame in the voice to be detected is a voice frame or a non-voice frame based on the signal-noise likelihood ratio; adaptively updating the speech model based on all the speech frames and adaptively updating the non-speech model based on all the non-speech frames; respectively calculating a final speech likelihood value and a final non-speech likelihood value of each feature frame by using the updated speech model and the updated non-speech model; and calculating to obtain a final signal-to-noise likelihood ratio of each feature frame based on the final speech likelihood value and the final non-speech likelihood value, and detecting the end point of the speech to be detected based on the final signal-to-noise likelihood ratio. Through self-adaptive training of two classification models, namely voice and non-voice, the voice endpoint detection method and the voice endpoint detection device can adapt to excessive noise scenes, and the accuracy of voice endpoint detection can be improved.
In an optional embodiment of the present application, for the adaptive updating performed in step S105, the adaptively updating the speech model based on all the speech frames and the adaptively updating the non-speech model based on all the non-speech frames includes:
stopping the self-adaptive updating of the voice model and the non-voice model when the condition is met;
the conditions include at least one of:
until the result of the signal-to-noise likelihood ratio of the feature frame is stable;
until the updating error of the voice model or the non-voice model is smaller than a preset threshold value.
In an optional embodiment of the present application, in step S102, calculating the speech likelihood value of the pre-trained speech model and each of the feature frames and calculating the non-speech likelihood value of the pre-trained non-speech model and each of the feature frames are calculated by any one of the following algorithms, where the algorithms include:
maximum a posteriori probability algorithm, maximum likelihood estimation algorithm, maximum expectation algorithm.
In an optional embodiment of the present application, the method further comprises:
acquiring training data, wherein the training data comprises non-voice data and voice data;
training a first preset Gaussian mixture model by using the non-voice data to obtain the non-voice model;
training a second preset Gaussian mixture model by using the voice data to obtain the voice model;
adaptively updating the speech model based on all the speech frames and adaptively updating the non-speech model based on all the non-speech frames, wherein an algorithm of the adaptive updating comprises any one of the following:
maximum a posteriori probability MAP, maximum likelihood linear regression MLLR.
The algorithm for adaptive updating of the non-speech model based on the Gaussian mixture model and the speech model based on the Gaussian mixture model can select the maximum posterior probability MAP, the maximum likelihood linear regression MLLR and the like.
In an alternative embodiment of the present application, performing speech feature extraction on speech to be detected in step S101 shown in fig. 3 includes:
in one embodiment of the present application, MFCC features of speech to be detected are to be extracted, and the feature extraction step includes:
and S301, pre-emphasizing the voice to be detected, and lifting a high-frequency part of the voice to be detected to obtain the pre-emphasized voice to be detected. The high-frequency part is improved by pre-emphasis of the voice to be detected, so that the frequency spectrum of the signal becomes flat, the signal is kept in the whole frequency band from low frequency to high frequency, and the frequency spectrum can be obtained by the same signal-to-noise ratio.
Step S302, dividing the pre-emphasized voice to be detected into a plurality of initial voice frames. Dividing the voice to be detected according to a certain length to obtain multi-frame voice, wherein the voice does not specifically refer to human voice, but refers to any sound recorded in the voice with detection, namely the voice frame refers to the frame of the sound. The speech signal is an astable, time-varying signal. But the speech signal can be considered stationary and temporally invariant over a short time frame. Therefore, when speech signal processing is performed, a speech signal is processed in segments, each of which is called a frame, in order to reduce the influence of unsteady state and time variation of the entire speech signal. Further, the length of the feature frame, the speech frame or the frame in the non-speech frame is 20 ms. The speech to be detected is divided into a plurality of frames by taking 20ms as a scale, and the lengths of the frames of the subsequent characteristic frames, the speech frames and the non-speech frames are not changed and are the same as the lengths of the frames divided in the front.
Further, in the plurality of initial speech frames, there is no overlap between any two adjacent initial speech frames. Since speech and noise are usually uncorrelated for the previous and the next frame, no overlap is needed, thereby reducing the amount of computation. The time difference between the start positions of two adjacent frames is called frame shift, and in use, the frame shift value is generally 10 ms.
The observation is carried out by taking the frame as a unit, and different numbers of sampling points exist in one frame according to different frame lengths and sampling frequencies. In one embodiment of the present application, for audio with a 16000Hz sample and a frame length of 20ms, 16000 × 0.020 is 320 points in a frame, so the frame is also referred to as an observation unit formed by collecting N sample points, and the frame is shifted by 16000 × 0.010 is 160 points.
Step S303, multiplying each initial voice frame by a Hamming window to obtain a signal after intermediate processing. Each frame is multiplied by a hamming window to increase the continuity of the left and right ends of the frame.
And step S304, carrying out fast Fourier transform on the intermediate processed signal to obtain the frequency spectrum and energy distribution of the voice frame.
And carrying out fast Fourier transform on each frame signal subjected to framing and windowing to obtain the frequency spectrum and energy distribution of each frame.
Step S305, filtering the energy spectrum through a group of triangle filter groups with Mel scale so as to smooth the frequency spectrum of the voice frame. And the Mel filter bank is used for smoothing the frequency spectrum by passing the energy spectrum through a group of Mel-scale triangular filter banks, eliminating the effect of harmonic waves and highlighting the formants of the original voice.
And S306, calculating logarithmic energy output by each triangular filter bank, obtaining a Mel cepstrum MFCC coefficient by Discrete Cosine Transform (DCT) of the logarithmic energy, and taking the Mel cepstrum MFCC coefficient as the voice feature.
Based on the same inventive concept, an embodiment of the present application provides a voice endpoint detection apparatus. Referring to fig. 4, fig. 4 is a schematic diagram of a voice endpoint detection apparatus according to an embodiment of the present application. As shown in fig. 4, the apparatus includes:
the feature extraction module 401 is configured to perform speech feature extraction on the speech to be detected to obtain a plurality of feature frames;
a speech likelihood value calculation module 402, configured to calculate a speech likelihood value of each feature frame and a pre-trained speech model;
a non-speech likelihood value calculation module 403, configured to calculate a pre-trained non-speech model and a non-speech likelihood value of each feature frame;
a signal-to-noise likelihood ratio calculation module 404, configured to calculate a signal-to-noise likelihood ratio of each of the feature frames based on the speech likelihood values and the non-speech likelihood values;
a speech frame determining module 405, configured to determine, based on the signal-to-noise likelihood ratio, that a frame, corresponding to each feature frame, in the speech to be detected is a speech frame or a non-speech frame;
an adaptive update module 406, configured to perform adaptive update on the speech model based on all the speech frames, and perform adaptive update on the non-speech model based on all the non-speech frames;
an endpoint detection module 407, configured to calculate a final signal-to-noise likelihood ratio of each feature frame based on the final speech likelihood value and the final non-speech likelihood value, and detect an endpoint of the speech to be detected based on the final signal-to-noise likelihood ratio;
and respectively calculating the final speech likelihood value and the final non-speech likelihood value of each feature frame by using the updated speech model and the updated non-speech model.
In an optional embodiment of the present application, the adaptive update module includes:
the condition matching submodule is used for stopping the self-adaptive updating of the voice model and the non-voice model when the condition is met;
the conditions include at least one of:
until the result of the signal-to-noise likelihood ratio of the feature frame is stable;
until the updating error of the voice model or the non-voice model is smaller than a preset threshold value.
In an optional embodiment of the present application, in the speech likelihood value calculation module and the non-speech likelihood value calculation module, calculating the speech likelihood value of the pre-trained speech model and each of the feature frames and calculating the non-speech likelihood value of the pre-trained non-speech model and each of the feature frames are calculated by any one of the following algorithms, including:
maximum a posteriori probability algorithm, maximum likelihood estimation algorithm, maximum expectation algorithm.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The method and the device for detecting the voice endpoint provided by the application are introduced in detail, a specific example is applied in the text to explain the principle and the implementation of the application, and the description of the above embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A method for voice endpoint detection, the method comprising:
performing voice feature extraction on voice to be detected to obtain a plurality of feature frames;
calculating a pre-trained speech model and a speech likelihood value of each feature frame and calculating a pre-trained non-speech model and a non-speech likelihood value of each feature frame;
calculating a signal-to-noise likelihood ratio of each feature frame based on the speech likelihood value and the non-speech likelihood value;
judging whether the frame corresponding to each characteristic frame in the voice to be detected is a voice frame or a non-voice frame based on the signal-noise likelihood ratio;
adaptively updating the speech model based on all the speech frames and adaptively updating the non-speech model based on all the non-speech frames;
respectively calculating a final speech likelihood value and a final non-speech likelihood value of each feature frame by using the updated speech model and the updated non-speech model;
and calculating to obtain a final signal-to-noise likelihood ratio of each feature frame based on the final speech likelihood value and the final non-speech likelihood value, and detecting the end point of the speech to be detected based on the final signal-to-noise likelihood ratio.
2. The method of claim 1, wherein adaptively updating the speech model based on all of the speech frames and adaptively updating the non-speech model based on all of the non-speech frames comprises:
stopping the self-adaptive updating of the voice model and the non-voice model when the condition is met;
the conditions include at least one of:
until the result of the signal-to-noise likelihood ratio of the feature frame is stable;
until the updating error of the voice model or the non-voice model is smaller than a preset threshold value.
3. The method of claim 1, wherein calculating the phonetic likelihood of the pre-trained speech model and each of the feature frames and calculating the non-phonetic likelihood of the pre-trained non-speech model and each of the feature frames is calculated by any one of the following algorithms, the algorithms comprising:
maximum a posteriori probability algorithm, maximum likelihood estimation algorithm, maximum expectation algorithm.
4. The method of claim 1, further comprising:
acquiring training data, wherein the training data comprises non-voice data and voice data;
training a first preset Gaussian mixture model by using the non-voice data to obtain the non-voice model;
training a second preset Gaussian mixture model by using the voice data to obtain the voice model;
adaptively updating the speech model based on all the speech frames and adaptively updating the non-speech model based on all the non-speech frames, wherein an algorithm of the adaptive updating comprises any one of the following:
maximum a posteriori probability MAP, maximum likelihood linear regression MLLR.
5. The method of claim 1, wherein performing speech feature extraction on the speech to be detected comprises:
pre-emphasis is carried out on the voice to be detected, the high-frequency part of the voice to be detected is promoted, and the pre-emphasized voice to be detected is obtained;
dividing the pre-emphasized voice to be detected into a plurality of initial voice frames;
multiplying each initial voice frame by a Hamming window to obtain an intermediate processed signal;
performing fast Fourier transform on the intermediate processed signal to obtain the frequency spectrum and energy distribution of the voice frame;
filtering the energy spectrum through a set of Mel-scale triangular filter sets to smooth the frequency spectrum of the speech frame;
and calculating the logarithmic energy output by each triangular filter bank, obtaining a Mel cepstrum MFCC coefficient by Discrete Cosine Transform (DCT) of the logarithmic energy, and taking the Mel cepstrum MFCC coefficient as the voice feature.
6. The method of claim 1, wherein the length of the feature frame, the speech frame, or a frame in the non-speech frame is 20 ms.
7. The method of claim 5, wherein there is no overlap between any two adjacent initial speech frames in the plurality of initial speech frames.
8. An apparatus for voice endpoint detection, the apparatus comprising:
the feature extraction module is used for extracting voice features of the voice to be detected to obtain a plurality of feature frames;
the speech likelihood value calculation module is used for calculating a pre-trained speech model and the speech likelihood value of each feature frame;
the non-voice likelihood value calculation module is used for calculating a pre-trained non-voice model and a non-voice likelihood value of each feature frame;
a signal-to-noise likelihood ratio calculation module for calculating a signal-to-noise likelihood ratio of each feature frame based on the speech likelihood value and the non-speech likelihood value;
the voice frame judging module is used for judging whether a frame corresponding to each characteristic frame in the voice to be detected is a voice frame or a non-voice frame based on the signal-to-noise likelihood ratio;
the self-adaptive updating module is used for carrying out self-adaptive updating on the voice model based on all the voice frames and carrying out self-adaptive updating on the non-voice model based on all the non-voice frames;
an endpoint detection module, configured to calculate a final signal-to-noise likelihood ratio of each feature frame based on the final speech likelihood value and the final non-speech likelihood value, and detect an endpoint of the speech to be detected based on the final signal-to-noise likelihood ratio;
and respectively calculating the final speech likelihood value and the final non-speech likelihood value of each feature frame by using the updated speech model and the updated non-speech model.
9. The apparatus of claim 8, wherein the adaptive update module comprises:
the condition matching submodule is used for stopping the self-adaptive updating of the voice model and the non-voice model when the condition is met;
the conditions include at least one of:
until the result of the signal-to-noise likelihood ratio of the feature frame is stable;
until the updating error of the voice model or the non-voice model is smaller than a preset threshold value.
10. The apparatus according to claim 8, wherein the calculation of the phonetic likelihood value of the pre-trained phonetic model and each of the feature frames and the calculation of the phonetic likelihood value of the pre-trained non-phonetic model and each of the feature frames in the phonetic likelihood value calculation module and the non-phonetic likelihood value calculation module are calculated by any one of the following algorithms, the algorithms including:
maximum a posteriori probability algorithm, maximum likelihood estimation algorithm, maximum expectation algorithm.
CN202011213344.2A 2020-11-03 2020-11-03 Voice endpoint detection method and device Pending CN112489692A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011213344.2A CN112489692A (en) 2020-11-03 2020-11-03 Voice endpoint detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011213344.2A CN112489692A (en) 2020-11-03 2020-11-03 Voice endpoint detection method and device

Publications (1)

Publication Number Publication Date
CN112489692A true CN112489692A (en) 2021-03-12

Family

ID=74928054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011213344.2A Pending CN112489692A (en) 2020-11-03 2020-11-03 Voice endpoint detection method and device

Country Status (1)

Country Link
CN (1) CN112489692A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115472152A (en) * 2022-11-01 2022-12-13 北京探境科技有限公司 Voice endpoint detection method and device, computer equipment and readable storage medium

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020165713A1 (en) * 2000-12-04 2002-11-07 Global Ip Sound Ab Detection of sound activity
US20040030544A1 (en) * 2002-08-09 2004-02-12 Motorola, Inc. Distributed speech recognition with back-end voice activity detection apparatus and method
US20040064314A1 (en) * 2002-09-27 2004-04-01 Aubert Nicolas De Saint Methods and apparatus for speech end-point detection
US20050131689A1 (en) * 2003-12-16 2005-06-16 Cannon Kakbushiki Kaisha Apparatus and method for detecting signal
US20060253283A1 (en) * 2005-05-09 2006-11-09 Kabushiki Kaisha Toshiba Voice activity detection apparatus and method
US20070271093A1 (en) * 2006-05-22 2007-11-22 National Cheng Kung University Audio signal segmentation algorithm
CN101308653A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 End-point detecting method applied to speech identification system
US20120209604A1 (en) * 2009-10-19 2012-08-16 Martin Sehlstedt Method And Background Estimator For Voice Activity Detection
US20120221330A1 (en) * 2011-02-25 2012-08-30 Microsoft Corporation Leveraging speech recognizer feedback for voice activity detection
US20140249812A1 (en) * 2013-03-04 2014-09-04 Conexant Systems, Inc. Robust speech boundary detection system and method
KR20180067920A (en) * 2016-12-13 2018-06-21 한국전자통신연구원 System and method for end-point detection of speech based in harmonic component
CN109545188A (en) * 2018-12-07 2019-03-29 深圳市友杰智新科技有限公司 A kind of real-time voice end-point detecting method and device
CN110706694A (en) * 2019-09-26 2020-01-17 成都数之联科技有限公司 Voice endpoint detection method and system based on deep learning
CN111862951A (en) * 2020-07-23 2020-10-30 海尔优家智能科技(北京)有限公司 Voice endpoint detection method and device, storage medium and electronic equipment

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020165713A1 (en) * 2000-12-04 2002-11-07 Global Ip Sound Ab Detection of sound activity
US20040030544A1 (en) * 2002-08-09 2004-02-12 Motorola, Inc. Distributed speech recognition with back-end voice activity detection apparatus and method
US20040064314A1 (en) * 2002-09-27 2004-04-01 Aubert Nicolas De Saint Methods and apparatus for speech end-point detection
US20050131689A1 (en) * 2003-12-16 2005-06-16 Cannon Kakbushiki Kaisha Apparatus and method for detecting signal
US20060253283A1 (en) * 2005-05-09 2006-11-09 Kabushiki Kaisha Toshiba Voice activity detection apparatus and method
US20070271093A1 (en) * 2006-05-22 2007-11-22 National Cheng Kung University Audio signal segmentation algorithm
CN101308653A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 End-point detecting method applied to speech identification system
US20120209604A1 (en) * 2009-10-19 2012-08-16 Martin Sehlstedt Method And Background Estimator For Voice Activity Detection
US20120221330A1 (en) * 2011-02-25 2012-08-30 Microsoft Corporation Leveraging speech recognizer feedback for voice activity detection
US20140249812A1 (en) * 2013-03-04 2014-09-04 Conexant Systems, Inc. Robust speech boundary detection system and method
KR20180067920A (en) * 2016-12-13 2018-06-21 한국전자통신연구원 System and method for end-point detection of speech based in harmonic component
CN109545188A (en) * 2018-12-07 2019-03-29 深圳市友杰智新科技有限公司 A kind of real-time voice end-point detecting method and device
CN110706694A (en) * 2019-09-26 2020-01-17 成都数之联科技有限公司 Voice endpoint detection method and system based on deep learning
CN111862951A (en) * 2020-07-23 2020-10-30 海尔优家智能科技(北京)有限公司 Voice endpoint detection method and device, storage medium and electronic equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
古丽拉・阿东别克, 于迎霞: "基于LPC美尔倒谱特征的带噪语音端点检测", 电声技术, no. 02 *
李晔;崔慧娟;唐昆;: "基于能量和鉴别信息的语音端点检测算法", 清华大学学报(自然科学版), no. 07 *
李燕诚;崔慧娟;唐昆;: "基于似然比测试的语音激活检测算法", 计算机工程, no. 10 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115472152A (en) * 2022-11-01 2022-12-13 北京探境科技有限公司 Voice endpoint detection method and device, computer equipment and readable storage medium
CN115472152B (en) * 2022-11-01 2023-03-03 北京探境科技有限公司 Voice endpoint detection method and device, computer equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN106935248B (en) Voice similarity detection method and device
EP1083541B1 (en) A method and apparatus for speech detection
EP2363852B1 (en) Computer-based method and system of assessing intelligibility of speech represented by a speech signal
JPH08508107A (en) Method and apparatus for speaker recognition
JP2006079079A (en) Distributed speech recognition system and its method
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
CN111540342B (en) Energy threshold adjusting method, device, equipment and medium
Archana et al. Gender identification and performance analysis of speech signals
CN108682432B (en) Speech emotion recognition device
KR100682909B1 (en) Method and apparatus for recognizing speech
Lee et al. Dynamic noise embedding: Noise aware training and adaptation for speech enhancement
CN112116909A (en) Voice recognition method, device and system
CN113823323A (en) Audio processing method and device based on convolutional neural network and related equipment
CN112489692A (en) Voice endpoint detection method and device
CN114303186A (en) System and method for adapting human speaker embedding in speech synthesis
JPS60200300A (en) Voice head/end detector
CN107507610B (en) Chinese tone recognition method based on vowel fundamental frequency information
JPH06110488A (en) Method and device for speech detection
Dumpala et al. Robust Vowel Landmark Detection Using Epoch-Based Features.
CN112509556B (en) Voice awakening method and device
Kasap et al. A unified approach to speech enhancement and voice activity detection
JPS60114900A (en) Voice/voiceless discrimination
CN110033786B (en) Gender judgment method, device, equipment and readable storage medium
US11270721B2 (en) Systems and methods of pre-processing of speech signals for improved speech recognition
Sudhakar et al. Automatic speech segmentation to improve speech synthesis performance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination