CN112652296A

CN112652296A - Streaming voice endpoint detection method, device and equipment

Info

Publication number: CN112652296A
Application number: CN202011543429.7A
Authority: CN
Inventors: 李锴; 丛继晔; 沈来信
Original assignee: Beijing Thunisoft Information Technology Co ltd
Current assignee: Beijing Thunisoft Information Technology Co ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2021-04-13
Anticipated expiration: 2040-12-23
Also published as: CN112652296B

Abstract

The application discloses a streaming voice endpoint detection method, a device and equipment, wherein the method comprises the following steps: judging whether the voice state of a point to be detected of the streaming voice meets a preset condition or not by using a voice endpoint detection model; and when the voice state of the point to be detected of the streaming voice meets the preset condition, confirming that the point to be detected is a voice endpoint.

Description

Streaming voice endpoint detection method, device and equipment

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, and a device for detecting a streaming speech endpoint.

Background

In the prior art, the voice endpoint detection method is a technology for judging whether real voice exists in given audio data or not based on a threshold and a statistical probability model. It is often used in algorithms for speech recognition, speech codec, noise reduction, gain, etc. to separate speech from non-speech data. In recent years, deep learning techniques have also been applied in algorithms for speech endpoint detection.

In the process of realizing the prior art, the inventor finds that:

the conventional voice endpoint detection method, whether a traditional probability model or a deep learning model, is already in a mature stage in accuracy, but when the environment is in transient voice noise or background noise, the conventional voice endpoint detection method still can cause the situation that jump is generated by misjudgment of the voice endpoint detection method or the situation that the model misjudges and loses words due to the problems of unclear pronunciations and the like of the start point and the tail point of the voice endpoint in the prior art.

Therefore, the invention provides a streaming voice endpoint detection method, a device and equipment which can reduce jump caused by noise or word loss caused by unclear pronunciation and pronunciation.

Disclosure of Invention

The embodiment of the application provides a streaming voice endpoint detection method, a device and equipment capable of reducing noise to cause jump or losing characters due to unclear pronunciation of a spitting character, and aims to solve the problem that jump is generated by misjudgment of the voice endpoint detection method or misjudgment of a model and word loss are caused by the problems that the start point and the tail point of a voice endpoint are unclear and the like.

A streaming voice endpoint detection method, comprising:

judging whether the voice state of a point to be detected of the streaming voice meets a preset condition or not by using a voice endpoint detection model;

and when the voice state of the point to be detected of the streaming voice meets the preset condition, confirming that the point to be detected is a voice endpoint.

Further, when the voice state of the point to be detected of the streaming voice satisfies the preset condition, determining that the point to be detected is a voice endpoint specifically includes:

when the voice state of the point to be detected is changed from mute to non-mute, and the number of detection points with continuous non-mute states of the point to be detected is greater than or equal to a preset parameter, the mute state of the point to be detected is a voice endpoint;

when the voice state of the point to be detected is changed from non-mute state to mute state, and the number of the detection points with the mute state continuous of the point to be detected is greater than or equal to the preset parameter, the detection points with the number of the detection points with the mute state continuous of the point to be detected equal to the preset parameter are the voice end points.

Further, when the voice state of the point to be detected is changed from mute to non-mute, and the number of the detection points in the non-mute state of the point to be detected is greater than or equal to the preset parameter, the mute state of the point to be detected is the starting endpoint of the voice endpoint.

Further, when the voice state of the point to be detected is changed from non-silent state to silent state, and the number of the detection points in the silent state of the point to be detected is greater than or equal to the preset parameter, the number of the detection points in the silent state of the point to be detected, which is equal to the preset parameter, is the ending end point of the voice end point.

Further, the preset parameters are set according to the length of each frame in the streaming voice, and specifically include:

the length of the voice frame is set to be 20ms-50ms, and the preset parameter is set to be 5-7 according to the length of the voice frame.

Further, the number of segment frames of the streaming voice is equal to the number of frames specified by the voice endpoint detection model.

Furthermore, the voice endpoint detection model is trained by one of convlstm neural networks which are the combination of a feedforward type sequence memory network FSMN neural network, a convolutional neural network CNN and a cyclic neural network LSTM.

Further, the main process of the convlstm neural network training of the voice endpoint detection model by using the combination of the convolutional neural network CNN and the cyclic neural network LSTM is as follows: the system comprises an input layer, a one-dimensional convolution operation layer, an LSTM layer, a full connection layer, a batch normalization layer, an activation layer, a full connection layer and an output layer.

A streaming voice endpoint detection apparatus, the voice endpoint detection apparatus comprising:

the input module is used for judging whether the voice state of a point to be detected of the streaming voice meets a preset condition or not by using a voice endpoint detection model;

and the output module is used for confirming that the point to be detected is a voice endpoint when the voice state of the point to be detected of the streaming voice meets the preset condition.

A streaming voice endpoint detection device, the voice endpoint detection device comprising: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invokes the instructions in the memory to cause the streaming voice endpoint detection device to perform the voice endpoint detection method of claims 1-8.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart of a streaming voice endpoint detection method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a network structure of a streaming voice endpoint detection model according to an embodiment of the present application;

fig. 3 is a diagram illustrating a judgment result of a streaming voice endpoint detection provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of a streaming voice endpoint detection apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, the streaming voice endpoint detection method provided in the present application includes:

s100: and judging whether the voice state of the point to be detected of the streaming voice meets a preset condition or not by using a voice endpoint detection model.

The streaming voice is one of a section of voice which is made through various steps and specially set for a detection end point or randomly acquired.

Further, in a preferred embodiment provided by the present application, the streaming voice may be a randomly acquired segment of voice; or the voice can be made through various steps and specially set for detecting the end point. The voice is set for the detection end point through a plurality of steps, namely, a plurality of spaced blank segments are randomly inserted into existing pure voice data to be used as a mute segment, and after audio frequency is formed between the mute segment and the pure voice data which is not the mute segment, the noise data is randomly mixed with the existing audio frequency to obtain streaming voice. Setting the length of the mute section to be specific time according to the duration of the pure voice, wherein the length of the mute is set to be 1s-5s randomly; when each mute section is inserted into the existing pure voice data, a corresponding label is generated so as to facilitate subsequent use; the frame number of the mute section is consistent with the frame number of the original pure voice; and the noise intensity of the noise data may be randomly set where the noise intensity of the noise data is set to-5 dB to-25 dB. It is understood that the specific setting of the noise intensity of the pure speech data, the silence segments and the noise data is not limited to the specific protection scope of the present application.

The number of segment frames of the streaming voice is equal to the number of frames specified by the voice endpoint detection model.

Further, in a preferred embodiment provided by the present application, the number of segment frames of the streaming voice should be set to enable the audio detected by the voice endpoint detection model, and therefore, the number of segment frames of the streaming voice should be equal to the number of frames specified by the voice endpoint detection model.

The voice endpoint detection model is obtained by training streaming voice which is manufactured through various steps and specially set for detecting endpoints for multiple times. Dividing the streaming voice into a training set and a verification set according to a proportion, and extracting Mel frequency cepstrum coefficient characteristics from the training set and the verification set; the acquired mel-frequency cepstrum coefficient characteristics are obtained by training through a feedforward type sequence memory network neural network or a convlstm neural network which is a combination of a convolutional neural network CNN and a cyclic neural network LSTM.

Further, in a preferred embodiment provided herein, the streaming voice is presented as 7: 3, dividing the ratio into a training set and a verification set, then extracting the characteristics of the mel-frequency cepstrum coefficients from the training set and the verification set, and obtaining the mel-frequency cepstrum coefficients in a specific process: pre-emphasis, framing and windowing, fast Fourier transform, angular band-pass filters, calculation of logarithmic energy and logarithmic energy output by each filter bank and extraction of dynamic differential parameters. The input data dimension of the mel frequency cepstrum coefficient adopts 36 x 10, the characteristics of the mel frequency cepstrum coefficient are 36 dimensions, 10 is context length and 10 frames, the length of each frame is positioned to be 30ms, and after the frame number is combined, namely, whether the current frame is a non-mute section is predicted by using audio with the length of 300ms each time. It is to be understood that the specific numerical values for processing the voice data herein are not to be construed as limiting the specific scope of the present application.

Referring to fig. 2, the neural network used in the present application adopts a convlstm neural network which is a combination of a convolutional neural network CNN and a cyclic neural network LSTM to train the acquired mel-frequency cepstrum coefficient features, and the training sequence is as follows: the system comprises an input layer, a one-dimensional convolution operation layer, an LSTM layer, a full connection layer, a batch normalization layer, an activation layer, a full connection layer and an output layer. In the embodiment of the present application, a convlstm neural network that is a combination of a convolutional neural network CNN and a cyclic neural network LSTM adopts a following specific formula of Focal local:

FL(pt)＝-(1-pt)γlog(pt)，

where γ is equal to or greater than zero, and is set to 2 here, and (1-pt) γ is a loss function balance factor for inclining the parameters to a few cases and a difficult case.

The training is performed to obtain the voice endpoint detection model, and it should be understood that the specific numerical values for processing the voice data herein obviously do not limit the specific protection scope of the present application.

And judging whether the voice state of the point to be detected of the streaming voice meets the preset condition by the voice endpoint detection model to obtain the point to be detected of the streaming voice.

Further, in a preferred embodiment provided in the present application, the determining a speech state of a point to be detected of a streaming speech by using a speech endpoint detection model specifically includes:

referring to fig. 3, in the embodiment of the present application, the non-silence and silence results obtained by determining the flow-type speech are represented by arabic numerals:

if the voice endpoint detection model detects that the voice is changed from mute to non-mute or is non-mute, the output result is 1;

if the voice endpoint detection model detects that the voice is changed from non-mute to mute or is mute, the output result is 0.

And judging the voice state through the voice endpoint model so as to obtain a digital combination formed by combining 0 or 1 as shown in the third figure. It can be seen from the digital combination that some points to be detected of the streaming voice satisfy the preset condition, and some points do not satisfy the preset condition, and when the preset condition is not satisfied, the method can be specifically divided into two cases: when the point to be detected is not converted from mute to non-mute, or from non-mute to mute, the state of mute or non-mute duration can be judged without judgment; when the voice state of the point to be detected is converted from mute to non-mute, the number of the detection points with continuous mute and non-mute states at two sides of the point to be detected is smaller than the preset parameter, the voice state of the point to be detected is converted from non-mute to mute, and the number of the detection points with continuous non-mute and mute states at two sides of the point to be detected is smaller than the preset parameter, the voice state of the point to be detected can be determined as that the point to be detected is not a voice endpoint but a jump is generated by misjudgment of a voice endpoint detection method, or a model misjudgment and word loss are caused by the. When the preset condition is satisfied, the voice endpoint can be determined. It is to be understood that the expression used for the output here obviously does not constitute a limitation to the scope of the present application.

S200: and when the voice state of the point to be detected of the streaming voice meets the preset condition, confirming that the point to be detected is a voice endpoint. When the voice endpoint meets the preset condition, namely when the voice state of the point to be detected is changed from mute to non-mute, and the number of the detection points with continuous non-mute states of the point to be detected is greater than or equal to the preset parameter, the mute state of the point to be detected is the voice endpoint; when the voice state of the point to be detected is changed from non-mute state to mute state, and the number of the detection points with the mute state continuous of the point to be detected is greater than or equal to the preset parameter, the detection points with the number of the detection points with the mute state continuous of the point to be detected equal to the preset parameter are the voice end points. When the voice endpoint is greater than or equal to the preset parameter, the maxbuff function is also used to determine the specific frame of the voice endpoint, and in this embodiment, the preset parameter is the same as the maxbuff value.

Further, in a preferred embodiment provided by the present application, when the voice state of the point to be detected is changed from mute to non-mute, and the number of detection points in which the non-mute state of the point to be detected is continuous is greater than or equal to a preset parameter, the first frame in maxbuf is the frame in the mute state of the point to be detected, which is the starting end point of the voice end point.

Further, in a preferred embodiment provided by the present application, when the voice state of the point to be detected is changed from non-silent state to silent state, and the number of detected points in the silent state of the point to be detected is greater than or equal to the preset parameter, the last frame in maxbuff, that is, the frame in which the number of detected points in the silent state of the point to be detected is equal to the number of detected points in the silent state of the point to be detected, is the ending end point of the voice end point.

Wherein, the setting of the preset parameter is decided according to the length of each frame in the streaming voice.

Further, in a preferred embodiment provided by the present application, the preset parameter is set according to the length of each frame in the streaming voice, the length of the streaming voice frame is set to 20ms-50ms, and the preset parameter is set to 5-7 according to the length of the voice frame. In this embodiment, the length of the streaming voice frame is set to 30ms, the preset parameter setting value is 5, that is, any continuous 5 points to be detected in the streaming voice are one unit, and when the result output by the voice endpoint detection model is continuous 50 or 1 or more than 50 or 1, a part of the preset condition is satisfied. It will be understood that the specific values of the preset parameters are not intended to limit the scope of the present application.

Referring to fig. 4, a streaming voice endpoint detection apparatus 100 includes:

the input module 10 determines whether the voice state of the point to be detected of the streaming voice satisfies a preset condition by using the voice endpoint detection model.

Further, in an embodiment provided in the present application, the input module 10 is configured to determine whether a voice state of a point to be detected of streaming voice satisfies a preset condition by using a voice endpoint detection model.

And the output module 11 is used for confirming that the point to be detected is a voice endpoint when the voice state of the point to be detected of the streaming voice meets the preset condition.

Further, in an embodiment provided by the present application, the output module 11 is configured to confirm that a point to be detected of the streaming voice is a voice endpoint when a voice state of the point meets a preset condition.

A specific application of the speech end-point detection means here is understood to be a virtual means, e.g. a software product similar to a browser. One specific application of the input module 10 and the output module 11 can be understood as functional functions which can be packaged independently.

A streaming voice endpoint detection device, the voice endpoint detection device comprising: a memory having instructions stored therein, and at least one processor.

The memory stores a voice endpoint detection program, which when executed by the processor implements the steps of the voice endpoint detection method as described in any of the above embodiments. The method implemented by the voice endpoint detection program executed by the processor may refer to the embodiments of the voice endpoint detection method of the present invention, and therefore, the description thereof is not repeated.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for streaming voice endpoint detection, comprising:

2. The method according to claim 1, wherein when the voice state of the point to be detected of the streaming voice satisfies a preset condition, confirming that the point to be detected is a voice endpoint specifically includes:

3. The method according to claim 2, wherein when the voice state of the point to be detected is changed from mute to non-mute, and the number of the detection points in which the non-mute state of the point to be detected continues is greater than or equal to a preset parameter, the mute state of the point to be detected is the starting endpoint of the voice endpoint.

4. The method according to claim 2, wherein when the voice state of the point to be detected is changed from non-silent state to silent state, and the number of the detection points in the silence state of the point to be detected is greater than or equal to the preset parameter, the number of the detection points in the silence state of the point to be detected is equal to the number of the detection points in the silence state of the preset parameter, and the detection points are the end points of the voice end points.

5. The method of claim 2, wherein the preset parameters are set according to a length of each frame in the streaming speech, and specifically include:

6. The streaming voice endpoint detection method of claim 1, where a number of segment frames of the streaming voice is equal to a number of frames specified by a voice endpoint detection model.

7. The streaming voice endpoint detection method of claim 1, wherein the voice endpoint detection model is trained by one of convlstm neural networks, which is a combination of a feedforward type sequence memory network FSMN neural network, a convolutional neural network CNN, and a cyclic neural network LSTM.

8. The streaming voice endpoint detection method according to claim 7, wherein the primary process of the training of the voice endpoint detection model using convlstm neural network, which is a combination of convolutional neural network CNN and cyclic neural network LSTM, is as follows: the system comprises an input layer, a one-dimensional convolution operation layer, an LSTM layer, a full connection layer, a batch normalization layer, an activation layer, a full connection layer and an output layer.

9. A streaming voice endpoint detection apparatus, the voice endpoint detection apparatus comprising:

10. A streaming voice endpoint detection device, the voice endpoint detection device comprising: a memory and at least one processor, the memory having instructions stored therein;