CN113345423B

CN113345423B - Voice endpoint detection method, device, electronic equipment and storage medium

Info

Publication number: CN113345423B
Application number: CN202110705850.1A
Authority: CN
Inventors: 王庆然; 万根顺; 高建清; 刘聪; 王智国; 胡国平
Original assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2024-02-13
Anticipated expiration: 2041-06-24
Also published as: CN113345423A

Abstract

The invention provides a voice endpoint detection method, a voice endpoint detection device, an electronic device and a storage medium, wherein the voice endpoint detection method comprises the following steps: acquiring voice characteristics and acoustic state posterior characteristics of each voice frame in a voice data stream; fusing the voice characteristics of each voice frame and the acoustic state posterior characteristics to obtain semantic fusion characteristics of each voice frame; and detecting the voice endpoint of the voice data stream based on the semantic fusion characteristics of each voice frame. According to the method, the device, the electronic equipment and the storage medium, the voice endpoint detection is carried out by fusing the voice characteristics and the acoustic state posterior characteristics of each voice frame, so that the anti-interference capability of the voice endpoint detection can be improved, voice fragments without specific semantics or irrelevant semantics are filtered, and the problem that the man-machine interaction process is interrupted in advance due to false triggering is avoided. The application of semantic information in the acoustic state posterior feature greatly reduces the calculated amount and ensures the requirements of real-time performance and low delay of endpoint detection.

Description

Voice endpoint detection method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of voice interaction technologies, and in particular, to a method and apparatus for detecting a voice endpoint, an electronic device, and a storage medium.

Background

To implement voice-based human-computer interaction functionality, voice endpoints in a segment of voice are typically identified by voice endpoint detection (Voice Activity Detection, VAD) techniques, whereby a valid segment of voice is obtained to perform subsequent operations.

Compared with the traditional VAD technology, the VAD technology in the man-machine conversation scene has the difficulty that not only noise irrelevant to human voice needs to be filtered more accurately, but also answer content which is not clear in semantic information or irrelevant to the current scene content needs to be filtered according to the semantic content of the user answer, and no response is made to the answer content.

Current VAD techniques are only able to detect human/non-human voice and cannot analyze semantic information contained in the voice, and may introduce significant amounts of meaningless content in subsequent voice processing, increasing system operating delay and unnecessary power consumption. In addition, if the user pauses during the speaking process, the current VAD technology cannot judge whether the user has complete ideas, and interaction interruption can be triggered in advance, so that interaction experience is affected.

Disclosure of Invention

The invention provides a voice endpoint detection method, a voice endpoint detection device, electronic equipment and a storage medium, which are used for solving the problems that in the prior art, voice endpoint detection can only detect human voice/non-human voice, so that operation delay, power consumption increase and interaction is interrupted in advance.

The invention provides a voice endpoint detection method, which comprises the following steps:

acquiring voice characteristics and acoustic state posterior characteristics of each voice frame in a voice data stream;

fusing the voice characteristics of each voice frame and the acoustic state posterior characteristics to obtain semantic fusion characteristics of each voice frame;

and detecting the voice endpoint of the voice data stream based on the semantic fusion characteristics of each voice frame.

According to the method for detecting a voice endpoint provided by the invention, the method for acquiring the voice characteristics and the acoustic state posterior characteristics of each voice frame in the voice data stream comprises the following steps:

taking any voice frame in the voice data stream as a center, extracting a voice frame sequence with a preset length from the voice data stream as a reference sequence of any voice frame;

and determining the voice characteristics and the acoustic state posterior characteristics of any voice frame based on the reference sequence of the voice frame.

According to the voice endpoint detection method provided by the invention, the fusion of the voice characteristics of each voice frame and the posterior characteristics of the acoustic state is carried out to obtain the semantic fusion characteristics of each voice frame, and the voice endpoint detection method comprises the following steps:

based on a compression encoder, fusing and compressing the voice characteristics of each voice frame and the posterior acoustic state characteristics to obtain semantic fusion characteristics of each voice frame;

The compression encoder is trained in conjunction with a decoder for recovering features compressed by the compression encoder.

According to the voice endpoint detection method provided by the invention, the compression encoder is determined based on the following steps:

determining an initial model comprising an encoder and a decoder connected by an attention mechanism;

and training the initial model by taking the sample characteristics input into the initial model and the restoring characteristics output by the initial model as targets, and taking the encoder in the trained initial model as the compression encoder.

According to the voice endpoint detection method provided by the invention, the voice endpoint detection is carried out on the voice data stream based on the semantic fusion characteristics of each voice frame, and the voice endpoint detection method comprises the following steps:

determining a silence detection result of each voice frame based on the semantic fusion characteristics of each voice frame and the semantic fusion characteristics of the front voice frame and the rear voice frame of each voice frame;

and determining a voice endpoint detection result of the voice data stream based on the silence detection result of each voice frame.

According to the voice endpoint detection method provided by the invention, the silence detection result of each voice frame is determined based on the semantic fusion characteristics of each voice frame and the semantic fusion characteristics of the front voice frame and the rear voice frame of each voice frame, and the method comprises the following steps:

Based on semantic fusion characteristics of each voice frame, respectively performing silence detection on each voice frame to obtain initial detection probability of each voice frame;

and determining a silence detection result of any voice frame based on the initial detection probability of the voice frame and the voice frames before and after the voice frame and a fusion weight, wherein the fusion weight is determined based on the time interval between the corresponding voice frame and the any voice frame.

According to the voice endpoint detection method provided by the invention, based on the semantic fusion characteristics of each voice frame, silence detection is carried out on each voice frame to obtain the initial detection probability of each voice frame, and the voice endpoint detection method comprises the following steps:

performing multi-head attention conversion on semantic fusion characteristics of any voice frame to obtain hidden layer characteristics of any voice frame;

and carrying out silence detection on any voice frame based on the hidden layer characteristics of the any voice frame to obtain initial detection probability of the any voice frame.

The invention also provides a voice endpoint detection device, which comprises:

the feature extraction unit is used for obtaining the voice features and the acoustic state posterior features of each voice frame in the voice data stream;

the feature fusion unit is used for fusing the voice features of each voice frame and the acoustic state posterior features to obtain semantic fusion features of each voice frame;

And the end point detection unit is used for detecting the voice end point of the voice data stream based on the semantic fusion characteristics of each voice frame.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of any one of the above-mentioned voice endpoint detection methods when executing the computer program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the speech endpoint detection method as described in any of the above.

According to the voice endpoint detection method, the voice endpoint detection device, the electronic equipment and the storage medium, the voice endpoint detection is carried out by fusing the voice characteristics and the acoustic state posterior characteristics of each voice frame, the anti-interference capability of the voice endpoint detection can be improved, voice fragments without specific semantics or irrelevant semantics are filtered, and the problem that the man-machine interaction process is interrupted in advance due to false triggering is avoided. Compared with a method for obtaining a transcribed text and extracting semantic features after decoding and searching are completed, the method has the advantages that the calculated amount is greatly reduced, and the requirements of instantaneity and low delay of endpoint detection are met.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are some embodiments of the invention and that other drawings can be obtained from them without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for detecting a voice endpoint according to the present invention;

fig. 2 is a flowchart illustrating step 110 in the voice endpoint detection method according to the present invention;

FIG. 3 is a flow chart of a method for determining a compression encoder according to the present invention;

FIG. 4 is a schematic view of the structure of the initial model provided by the present invention;

FIG. 5 is a flowchart illustrating a step 130 in the method for detecting a voice endpoint according to the present invention;

FIG. 6 is a flowchart illustrating a method for detecting a voice endpoint according to the present invention;

FIG. 7 is a schematic diagram of a voice endpoint detection apparatus according to the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

At present, human-computer interaction based on voice is generally performed by detecting voice fragments and then performing semantic understanding, and the method can be divided into three steps, namely: and detecting effective voice fragments of the user speaking, extracting semantic information of the user speaking content from the effective voice fragments, and editing answer content according to the semantic information.

The first step is to detect the effective speech fragments of the user speaking, which is realized by the general VAD technology at present, and the general VAD technology can screen out the part of the speech data which is actually speaking by the user and remove noise such as environmental noise. Every time a user speaks a sentence, the conversation system extracts a valid speech segment. In addition, the VAD technology can realize the function of session cutting, and the session system can not judge when the user finishes speaking and calls the answer voice, so that the interaction logic currently set by the session system is to determine that the user finishes speaking currently once the tail end point of the valid voice segment is detected, and enter the semantic understanding and subsequent answer processes.

However, the general VAD technology can only detect voice/non-voice, and cannot analyze semantic information contained in voice, so that the anti-environmental noise interference capability is weak, when environmental noise (such as voice and current voice of a beating table) or a nearby person speaks (side channel voice), the VAD detection result may be abnormal, and the specific reasons are two, namely, the environmental noise without voice or artificial noise (such as laughing voice and coughing voice) is wrongly judged as normal voice content, so that the interaction process is interrupted in advance, and voice fragments without actual content are returned; and secondly, intercepting meaningless voices returned to a long pause, such as a series of words and pauses, meaningless contents irrelevant to answer contents and the like, wherein the meaningless voices cannot be used for effective semantic information for a conversation system, but can interrupt the interaction process in advance, so that the conversation system is difficult to obtain real user speaking contents. Because the general VAD technology is very likely to be abnormal, the probability of false triggering of the interactive logic in the session system is very high, so that the session system is extremely unstable and the user experience is very poor.

In addition, if the user pauses during the speaking process, for example, the user pauses after saying "i want to make a call", the user thinks about the subsequent speaking content, but the general VAD technique has ended the session and returns the speech segment, but the backend system cannot grasp the object that the user wants to make a call, the key information that the user has not yet uttered is missed, and the advanced interruption of the interaction causes the backend system to fail to grasp the valid semantic information.

In order to reduce the probability of false triggering and avoid interrupting interactions in advance, it may be considered to introduce semantic understanding techniques into the interaction logic described above. However, introducing semantic understanding into the interactive logic can lead to the increase of the delay of a conversation system, a user can pause for a long time after speaking, and can respond, and the problem to be solved in the field of man-machine interaction is still urgent because the problem is solved in the field of man-machine interaction when the real-time performance is ensured and the man-machine interaction process is prevented from being interrupted in advance due to false triggering due to the fact that the VAD technology which is more suitable for a man-machine conversation scene is improved due to the limitation of the real-time performance requirement of the conversation system.

Fig. 1 is a schematic flow chart of a voice endpoint detection method provided by the invention, and as shown in fig. 1, the voice endpoint detection method provided by the invention can be applied to various common voice recognition scenes, such as conference transfer and intelligent customer service scenes, and can also be applied to dialog scenes which need to understand semantics in real time and have strict requirements on false triggering of noise. The method comprises the following steps:

Step 110, obtaining the voice characteristics and the acoustic state posterior characteristics of each voice frame in the voice data stream.

Here, the voice data stream is a data stream obtained by recording in real time, and the real-time recording may be voice recording or video recording, which is not particularly limited in the embodiment of the present invention.

In the process of recording the voice data stream in real time, feature extraction can be performed on each voice frame in the recorded voice data stream, wherein the feature extraction specifically comprises two aspects:

one of the aspects is the extraction of speech features of speech frames that are typically used for speech end point detection, which reflect information in acoustic aspects, such as sound intensity, loudness, pitch, etc., which can intuitively reflect whether the corresponding speech frame is a muted speech frame or an active speech frame.

The other aspect is the extraction of the posterior feature of the acoustic state of the speech frame for speech recognition, where the posterior feature of the acoustic state reflects the information about the semantic aspect, and may specifically include the acoustic state corresponding to the speech frame, and may also include the probability of the acoustic state corresponding to the speech frame, or the probability distribution of the acoustic state corresponding to each candidate of the speech frame, and so on.

In a common speech recognition flow, the acoustic state posterior feature is only an intermediate result, after the acoustic state posterior feature is obtained, a decoded search party is required to obtain a speech recognition transcription text, and after that, a feature extraction party is required to obtain semantic information on the transcription text.

In addition, semantic information contained in the acoustic state posterior feature is directly applied, the difference of audio frequency on a channel can be filtered, the source of sample data required for acquiring the semantic information is widened, and data including telephone call data, conference data, voice input method data and the like can be applied to a training process of semantic information extraction.

And 120, fusing the voice characteristics of each voice frame and the acoustic state posterior characteristics to obtain semantic fusion characteristics of each voice frame.

Specifically, after the speech features and the acoustic state posterior features of each speech frame are obtained, the speech features and the acoustic state posterior features of each speech frame may be fused respectively, where the fusion may be directly adding the speech features and the acoustic state posterior features, or the speech features and the acoustic state posterior features may be weighted and summed according to a preset weight, or the speech features and the acoustic state posterior features may be spliced and then subjected to feature compression, and the speech features and the acoustic state posterior features may be subjected to feature compression and then spliced fusion.

The semantic fusion characteristics of each voice frame can be obtained, namely the fusion characteristics containing the information of the corresponding voice frame in acoustic aspect and semantic aspect.

And 130, detecting voice end points of the voice data stream based on the semantic fusion characteristics of each voice frame.

Specifically, when the semantic fusion features of each voice frame are obtained, voice endpoint detection can be performed based on the semantic fusion features, and because the semantic fusion features for voice endpoint detection contain information in acoustic aspects and semantic aspects, the information in acoustic aspects and semantic aspects is considered during endpoint detection, so that the anti-interference capability of voice endpoint detection is enhanced.

Further, voice endpoint detection is performed on the voice data stream, for example, silence detection can be performed on each voice frame according to semantic fusion characteristics of each voice frame, so that the type of each voice frame, namely, the silence voice frame or the active voice frame, is judged, and the voice endpoint of the voice data stream is determined on the basis; for another example, a voice frame sequence may be sequentially selected from the voice data stream according to a preset sliding window, so as to determine whether a voice endpoint exists in the voice frame sequence and locate the voice endpoint based on the semantic fusion feature of each voice frame in the voice frame sequence.

According to the method provided by the embodiment of the invention, the voice endpoint detection is carried out by fusing the voice characteristics and the acoustic state posterior characteristics of each voice frame, so that the anti-interference capability of the voice endpoint detection can be improved, voice fragments without specific semantics or irrelevant semantics are filtered, and the problem that the man-machine interaction process is interrupted in advance due to false triggering is avoided. Compared with a method for obtaining a transcribed text and extracting semantic features after decoding and searching are completed, the method has the advantages that the calculated amount is greatly reduced, and the requirements of instantaneity and low delay of endpoint detection are met.

Based on any of the above embodiments, fig. 2 is a flowchart of step 110 in the voice endpoint detection method provided by the present invention, and as shown in fig. 2, step 110 includes:

step 111, taking any voice frame in the voice data stream as the center, extracting the voice frame sequence with preset length from the voice data stream as the reference sequence of the voice frame.

In particular, the voice data stream itself has time sequence, and voice endpoint detection based on the voice data stream also belongs to tasks with strong correlation of time sequence. Therefore, when the speech characteristics and the acoustic state posterior characteristics of each speech frame are acquired in units of speech frames, only the information of a single speech frame cannot be considered, and the speech characteristics and the acoustic state posterior characteristics of each speech frame need to be mined by combining the information of each speech frame before and after the single speech frame.

Assuming that the speech frame that is currently required to extract the speech features and the acoustic state posterior features is any speech frame in the speech data stream, the speech frame needs to be expanded forward and backward from the speech data stream with the speech frame as a center to form a reference sequence of the speech frame for assisting in feature extraction of the speech frame.

Here, the reference sequence of any one speech frame includes the speech frame, and the speech frame is at a center position in the reference sequence. For example, for the mth speech frame, a speech frame sequence from the mth to the mth+w speech frames in the speech data stream may be extracted, and the speech frame sequence is used as a reference sequence of the mth speech frame, where the length of the reference sequence is a preset length, and the preset length is 2w+1, where w is a positive integer.

Step 112, determining a speech feature and an acoustic state posterior feature of the speech frame based on the reference sequence of the speech frame.

Specifically, the extraction of the voice characteristics and the acoustic state posterior characteristics of the voice frame is performed based on the reference sequence, and in the process, both history information and future information are utilized, so that the extracted voice characteristics and acoustic state posterior characteristics can be more attached to acoustic information and semantic information contained in a voice data stream of the voice frame, and the reliability of voice endpoint detection is further improved.

Based on any of the above embodiments, in step 112, the determining the speech feature of the speech frame based on the reference sequence of the speech frame may be specifically implemented as follows:

the speech feature extracted from any speech frame can be implemented by a commonly used frame-level speech endpoint detection model, where the frame-level speech endpoint detection model is in consideration of time sequence, and structures such as Long Short-Term Memory (LSTM) and recurrent neural network (Recurrent Neural Network, RNN) are generally used. For example, the extraction of the voice features may adopt a structure of CNN (Convolutional Neural Networks, convolutional neural network) +lstm, and the acoustic features of each voice frame in the reference sequence of the voice frame may be first obtained, and the acoustic features are input into the structure of cnn+lstm to obtain the voice features of the voice frame. Here, the extraction of acoustic features may be obtained by a Filter Bank or MFCC (Mel-scale Frequency Cepstral Coefficients, mel-frequency cepstral coefficient) feature.

In addition, when speech feature extraction is implemented based on a commonly used frame-level speech endpoint detection model, a large number of sample audio training speech endpoint detection models may be applied first, and the portion of the speech endpoint detection model used to extract speech features of speech frames is applied in step 112. For example, a part of the intermediate hidden layer vector in the speech end point detection model that is outputted through the inputted audio may be used as a part of extracting the speech features of the speech frame.

Based on any of the above embodiments, in step 112, the determining of the posterior feature of the acoustic state of the speech frame based on the reference sequence of the speech frame may be specifically implemented by a pre-trained acoustic model, the reference sequence of the speech frame may be input into the acoustic model, and the posterior probability in decoding the acoustic model may be used as the posterior feature of the acoustic state of the speech frame to express the semantic information of the corresponding speech frame.

Based on any of the above embodiments, step 120 includes:

the compression encoder is trained in conjunction with a decoder for recovering the compressed features of the compression encoder.

Specifically, for any voice frame, the fusion of the voice features and the acoustic state posterior features of the voice frame is required to not only keep the information carried by the voice features and the acoustic state posterior features, but also avoid that part of information in the semantic fusion features obtained by fusion is selectively ignored in the subsequent task execution process. However, in the subsequent voice endpoint detection process, some mechanisms, such as attention mechanisms, applied by the voice endpoint detection may selectively apply only the voice feature on the left of the semantic fusion feature, and ignore the acoustic state posterior feature spliced on the right of the semantic fusion feature, which results in no reference to the semantic information in the actual detection process and affects the final endpoint detection result.

To address this problem, the fusion is performed by a compression encoder in the embodiment of the present invention. Here, the compression encoder is used for compressing the input features, and the speech features and the acoustic state posterior features may be input to the compression encoder as one feature after being spliced, or may be input to the compression encoder as two features, and the features are compressed and fused by the compression encoder, which is not particularly limited in the embodiment of the present invention.

Further, the compression encoder needs to ensure that the semantic fusion feature obtained by fusion does not miss information in the voice feature and the acoustic state posterior feature while realizing sufficient fusion of the voice feature and the acoustic state posterior feature. Thus, in the embodiment of the invention, the compression encoder is obtained by applying the combined training of the compression encoder and the decoder. Wherein the decoder assumes the task of restoring the features compressed by the compression encoder.

For example, the compression encoder takes on the task of compressing the feature a, compressing the input feature a into a ', and the decoder takes on the task of restoring the feature compressed by the compression encoder, decoding a' to obtain a ", and expecting a" to be as close as possible to the feature a before being compressed. Through the combined training of the compression encoder and the decoder, the compression encoder realizes feature compression and simultaneously ensures that the compressed semantic fusion features can not miss information in the voice features and the acoustic state posterior features before compression as much as possible, thereby ensuring the integrity of the information.

According to the method provided by the embodiment of the invention, the compression encoder obtained through combined training with the decoder fuses and compresses the voice characteristics and the acoustic state posterior characteristics of each voice frame, so that the semantic fusion characteristics of each voice frame obtained are more implicitly abstracted while all information is contained, and the reliability of the subsequent voice endpoint recognition is improved.

Based on any of the above embodiments, fig. 3 is a flow chart of a method for determining a compression encoder according to the present invention, and as shown in fig. 3, the compression encoder is determined based on the following steps:

at step 310, an initial model is determined, the initial model including an encoder and a decoder connected by an attention mechanism.

Specifically, the combined training of the compression encoder and the decoder can refer to the thought of image compression and recovery or the thought of text semantic extraction in the field of natural language processing. Before training, an initial model for training needs to be constructed first to achieve the task of vector lossy compression and recovery. Fig. 4 is a schematic structural diagram of an initial model provided by the present invention, and as shown in fig. 4, the initial model includes two parts, namely an encoder decode and a decoder decode, where an output end of the encoder and an input end of the decoder can be directly connected.

Preferably, to further enhance the ability of the initial model to perform vector compression and decoding recovery, the encoder and decoder may be connected by an attention mechanism, i.e. the output of the encoder is followed by an attention module, which is then connected to the input of the decoder. The addition of the attention mechanism makes the subsequent tasks more complex, so that the initial model obtained by training is more robust.

Step 320, training the initial model with the goal that the sample characteristics input into the initial model are consistent with the restored characteristics output from the initial model, and taking the encoder in the trained initial model as a compression encoder.

Specifically, after the initial model is obtained, training can be performed on the initial model. In the training process, the sample characteristics are input into an initial model, the encoder in the initial model compresses the sample characteristics, the decoder in the initial model decodes and restores the compressed sample characteristics, and the restored characteristics are output. The encoder in the initial model aims at completely abstracting the input features into an abstract vector with higher dimensionality, whether the information contained in the abstract vector is complete or not determines whether the decoder can completely and correctly restore the abstract vector back to the input features, so that in the training process, whether the input sample features are consistent with the output restoring features can be used as a measurement standard for judging whether the encoder in the initial model is damaged in the compression process or not, and initial model training is carried out with the aim of keeping the sample features of the input initial model consistent with the restoring features of the output initial model.

The encoder in the initial model after training can meet the requirement of lossless compression, so that the encoder in the initial model after training can be directly used as a compression encoder for fusing the voice characteristics and the acoustic state posterior characteristics of the compressed voice frame.

Based on any of the above embodiments, fig. 5 is a flowchart of step 130 in the voice endpoint detection method provided by the present invention, and as shown in fig. 5, step 130 includes:

step 131, determining a silence detection result of each voice frame based on the semantic fusion feature of each voice frame and the semantic fusion features of the front voice frame and the rear voice frame of each voice frame.

Specifically, considering that the voice endpoint detection itself belongs to a task with strong correlation of time sequence, silence detection based on the semantic fusion characteristics of a single voice frame cannot accurately obtain the silence detection result of a corresponding voice frame, so that the embodiment of the invention proposes that when silence detection is performed on the single voice frame, the semantic fusion characteristics of the voice frame are considered, and the semantic fusion characteristics of the voice frames before and after the voice frame are considered.

Here, the preceding and following voice frames of any voice frame do not particularly refer to one voice frame arranged before the voice frame and one voice frame arranged after the voice frame, but may be voice frames arranged before and after the voice frame for a preset time length, for example, the preset time length may be k frames, and for the mth voice frame, the preceding and following voice frames of the voice frame may include the mth-k to the mth-1 forward voice frame, and the mth+1th to the mth+k backward voice frames, k may be 1, or may be other positive integers.

The silence detection is performed on the semantic fusion characteristics of each voice frame and the front and rear voice frames thereof, and the semantic fusion characteristics of any voice frame and the front and rear voice frames thereof can be arranged into a sequence form according to a time sequence and input into a pre-trained silence detection model so as to obtain the silence detection result of the voice frame output by the silence detection model. Or the weight can be set for each front and back voice frame according to the time interval between the front and back voice frames and the voice frame, so as to perform weighted fusion on the voice frame and the semantic fusion characteristics of the front and back voice frames, and input the fusion result into a pre-trained silence detection model to obtain the silence detection result of the voice frame output by the silence detection model.

Here, the silence detection result of the voice frame is used to reflect whether the corresponding voice frame belongs to a silence voice frame or an active voice frame.

Step 132, determining a voice endpoint detection result of the voice data stream based on the silence detection result of each voice frame.

Specifically, after the silence detection result of each voice frame is obtained, the duration length of the silence segment or the active voice segment can be accumulated for each voice frame that is a silence voice frame or an active voice frame, so as to realize voice endpoint detection of the voice data stream, thereby determining the head endpoint and the tail endpoint of the effective voice segment possibly contained in the voice data stream, and facilitating outputting the effective voice segment for subsequent conversation.

Based on any of the above embodiments, step 131 includes:

and determining a silence detection result of the voice frame based on the initial detection probability of any voice frame and the voice frames before and after the voice frame and the fusion weight, wherein the fusion weight is determined based on the time interval between the corresponding voice frame and the voice frame.

Specifically, the silence detection is specific to a single voice frame, and the type of the voice frame can be judged according to the semantic fusion characteristic of any voice frame, namely whether the voice frame belongs to the silence voice frame or the active voice frame, and the initial detection probability of the voice frame is obtained. The initial detection probability here may include a probability that the speech frame is a mute speech frame, or includes a probability that the speech frame is an active speech frame, or further includes a probability that the speech frame is a mute speech frame and an active speech frame, respectively.

In consideration of the time sequence of the voice data stream, when determining the silence detection result of any voice frame, the initial detection probability of the voice frame before and after the voice frame can be referred to. In addition, considering that the probability of abrupt voice change is small, most of them are gradually changed with time, for any voice frame, the smaller the time interval between the front and rear voice frames and the voice frame is, the more similar the case of the front and rear voice frames and the voice frame is, the stronger the referenceability is, and the larger the time interval between the front and rear voice frames and the voice frame is, the weaker the referenceability of the front and rear voice frames to the voice frame is.

Therefore, the fusion weights of the front and rear voice frames and the voice frame can be determined according to the time interval between the front and rear voice frames and the voice frame, and the fusion weights of the voice frames with smaller time interval can be set to be larger according to the fact that the time interval can reflect the strength of the referenceability, and the strength of the referenceability can be directly corresponding to the size of the fusion weights.

On the basis, the initial detection probability of the voice frame and the initial detection probabilities of the voice frames before and after the voice frame can be subjected to weighted fusion based on the fusion weight of each voice frame aiming at any voice frame, so that the weighted fusion probability aiming at the voice frame is obtained, the silence detection result of the voice frame is obtained through judgment, for example, a judgment threshold value can be preset, if the weighted fusion probability is larger than the judgment threshold value, the voice frame is determined to be a mute voice frame, and otherwise, the voice frame is determined to be an active voice frame. The determination threshold may be set to 0.5 or 0.6, which is not particularly limited in the embodiment of the present invention.

Based on any of the above embodiments, based on the initial detection probability and the fusion weight of any speech frame and its preceding and following speech frames, the silence detection result of the speech frame is determined, and the following examples may be referred to:

Assuming that any speech frame is the mth speech frame, the preceding and following speech frames of the speech frame may include the mth-k to the mth-1 forward speech frame and the mth+1 to the mth+k backward speech frame, and the weighted fusion of the initial detection probabilities may be specifically represented by a weighted average, which may be represented by the following formula:

in the method, in the process of the invention,i.e. the weighted fusion probability of the mth speech frame obtained by the weighted fusion. y is _m-k To y _m+k Initial detection probability of mth voice frame and preceding and following voice frames, W _m-k To W _m+k Is the fusion weight of the mth voice frame and the front and back voice frames.

Wherein, fusion weight W _m-k To W _m+k Can be preset, for example, the fusion weight W _m-k To W _m+k The arithmetic progression may be that the fusion weight value of the mth speech frame approaches 1 as it approaches, and the fusion weight value is always greater than 0 as it approaches both ends. For example, when m=4 and k=2, the fusion weights of the 2 nd, 3 rd, 4 th, 5 th, and 6 th speech frames are respectively 0.5 th, 0.75 th, 1 th, 0.75 th, and 0.5 th, where the difference between every two frames may be 0.25 th.

Based on any of the above embodiments, in step 131, silence detection is performed on each voice frame based on the semantic fusion feature of each voice frame, so as to obtain an initial detection probability of each voice frame, including:

Performing multi-head attention conversion on the semantic fusion characteristics of any voice frame to obtain hidden layer characteristics of the voice frame;

and performing silence detection on the voice frame based on the hidden layer characteristics of the voice frame to obtain the initial detection probability of the voice frame.

Specifically, when silence detection is performed on semantic fusion features of a single voice frame, a attention mechanism can be applied to highlight more representative features in the semantic fusion features, so that hidden features which can realize more accurate silence detection are obtained in a deeper layer. Here, the acquisition of hidden layer features may be implemented by a self-attention mechanism, which may be specifically represented by the following formula:

in the method, in the process of the invention,semantic fusion feature for mth speech frame, d _xm For matrix->Is used for the vector dimension of (a),is a hidden layer feature of the mth speech frame obtained by the self-attention mechanism.

On the basis, in order to fully capture the information of the semantic fusion features in different spaces, a Multi-head attention mechanism (Multi-head attention) can be applied to realize Multi-path parallel self-attention conversion, so that the information contained in the hidden layer features is enriched, and the reliability of subsequent silence detection is improved.

Further, a multi-headed attention mechanism may be provided for each Multiplied by a random matrix W _i The self-attention conversion is performed separately. It should be noted that the random matrix W _i For realizing multiple linear conversion to obtain multiple linear converted semantic fusion features, and performing self-attention conversion on each linear converted semantic fusion feature to obtain multiple self-attention values, i.e. multiple attention outputs, which can be normalized into a vector output, assuming that the normalized matrix is W ^z The hidden layer feature output of the mth speech frame can be expressed as:where n is the number of parallel attentions of the multi-headed attentions mechanism, head ₁ ，head ₂ ，...，head _n Representing the respective attention value.

Based on any of the above embodiments, fig. 6 is a flowchart of a voice endpoint detection method according to the present invention, and as shown in fig. 6, the voice endpoint detection method may include the following steps:

firstly, for a real-time recorded voice data stream, sequentially taking each voice frame in the voice data stream as a center, and extracting a voice frame sequence with a preset length from the voice data stream as a reference sequence of each voice frame.

And secondly, respectively inputting the reference sequence of each voice frame into VAD time sequence modeling and acoustic modeling, so as to obtain the voice characteristic and the acoustic state posterior characteristic of each voice frame. The VAD timing modeling may be a part of a common speech endpoint detection model that outputs an intermediate hidden layer vector for an input audio, for example, may be a cnn+lstm structure, and thus the obtained speech feature may be a BN (Batch Normalization) feature, and the feature dimension may be t×512, where T represents a reference sequence length. The acoustic modeling may be a general acoustic model, and the posterior probability in decoding the acoustic model may be used as an acoustic state posterior feature of the speech frame to express semantic information of the corresponding speech frame, and a feature dimension of the acoustic state posterior feature may be t×9004.

Then, the voice characteristics and the acoustic state posterior characteristics of each voice frame can be spliced, the characteristics of each voice frame after being spliced are input into a compression encoder, and vector compression is carried out, so that the semantic fusion characteristics of higher dimensionality of each voice frame are obtained.

The multi-headed attention conversion may then be performed separately for the semantic fusion features of each speech frame, as outlined by the dashed boxes in fig. 5, i.e. the self-attention values of the heads resulting from the multi-headed attention conversion. On the basis, the self-attention value of the multiple heads of each voice frame is subjected to silence detection, and the initial detection probability of each voice frame can be obtained. And integrating the initial detection probability of each voice frame to obtain the silence detection result of each voice frame.

And finally, according to a preset frame-level decoding rule, applying silence detection results of each voice frame to detect voice end points, and obtaining end point detection results.

Based on any of the above embodiments, the flow of the voice endpoint detection method shown in fig. 6 may be implemented by an end-to-end model. In the end-to-end model, each step has a corresponding execution module.

The VAD time sequence modeling part can train a universal VAD task model through a large amount of audio data, and then copies parameters of a part of the VAD task model, which outputs the middle hidden layer vector for the input audio, into an end-to-end model. The parameters of this part may be updated in subsequent model training for end-to-end.

The part of acoustic modeling can be obtained by pre-training the resulting acoustic model with the word accuracy as the target. After the parameters of the acoustic model are copied to the end-to-end model, the parameters of the part are fixed and do not participate in updating in the subsequent training process of the end-to-end model.

The vector compression part can be obtained by training the initial model of the encoder+decoder structure by taking the sample characteristics input into the initial model and the restoring characteristics output by the initial model as targets. The trained encoder can be used for vector compression, and the parameters of the part are fixed and do not participate in the updating of the end-to-end model training process.

Both the partial and frame-level decoding models of the multi-head attention mechanism can be directly built in the end-to-end model and parameter updating can be performed in subsequent training of the end-to-end model.

Based on any of the above embodiments, fig. 7 is a schematic structural diagram of a voice endpoint detection apparatus according to the present invention, as shown in fig. 7, the apparatus includes:

a feature extraction unit 710, configured to obtain a speech feature and an acoustic state posterior feature of each speech frame in the speech data stream;

the feature fusion unit 720 is configured to fuse the speech feature of each speech frame with the acoustic state posterior feature to obtain a semantic fusion feature of each speech frame;

The endpoint detection unit 730 is configured to perform voice endpoint detection on the voice data stream based on the semantic fusion feature of each voice frame.

According to the device provided by the embodiment of the invention, the voice endpoint detection is carried out by fusing the voice characteristics and the acoustic state posterior characteristics of each voice frame, so that the anti-interference capability of the voice endpoint detection can be improved, voice fragments without specific semantics or irrelevant semantics are filtered, and the problem that the man-machine interaction process is interrupted in advance due to false triggering is avoided. Compared with a method for obtaining a transcribed text and extracting semantic features after decoding and searching are completed, the method has the advantages that the calculated amount is greatly reduced, and the requirements of instantaneity and low delay of endpoint detection are met.

Based on any of the above embodiments, the feature extraction unit 710 is configured to:

Based on any of the above embodiments, the feature fusion unit 720 is configured to:

Based on any of the above embodiments, the apparatus further comprises an encoder construction unit for:

Based on any of the above embodiments, the endpoint detection unit 730 is configured to:

Fig. 8 illustrates a physical structure diagram of an electronic device, as shown in fig. 8, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. Processor 810 may invoke logic instructions in memory 830 to perform a voice endpoint detection method comprising: acquiring voice characteristics and acoustic state posterior characteristics of each voice frame in a voice data stream; fusing the voice characteristics of each voice frame and the acoustic state posterior characteristics to obtain semantic fusion characteristics of each voice frame; and detecting the voice endpoint of the voice data stream based on the semantic fusion characteristics of each voice frame.

Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method of speech endpoint detection provided by the methods described above, the method comprising: acquiring voice characteristics and acoustic state posterior characteristics of each voice frame in a voice data stream; fusing the voice characteristics of each voice frame and the acoustic state posterior characteristics to obtain semantic fusion characteristics of each voice frame; and detecting the voice endpoint of the voice data stream based on the semantic fusion characteristics of each voice frame.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above-provided voice endpoint detection methods, the method comprising: acquiring voice characteristics and acoustic state posterior characteristics of each voice frame in a voice data stream; fusing the voice characteristics of each voice frame and the acoustic state posterior characteristics to obtain semantic fusion characteristics of each voice frame; and detecting the voice endpoint of the voice data stream based on the semantic fusion characteristics of each voice frame.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for detecting a voice endpoint, comprising:

based on the semantic fusion characteristics of each voice frame, detecting voice end points of the voice data stream;

the fusion of the voice characteristics of each voice frame and the posterior characteristics of the acoustic state is carried out to obtain the semantic fusion characteristics of each voice frame, and the method comprises the following steps:

the compression encoder is trained in combination with a decoder for recovering features compressed by the compression encoder;

the voice endpoint detection for the voice data stream based on the semantic fusion features of each voice frame comprises:

determining a silence detection result of any voice frame based on initial detection probability of the voice frame and the voice frames before and after the voice frame and fusion weight, wherein the fusion weight is determined based on a time interval between the corresponding voice frame and the any voice frame;

2. The method of claim 1, wherein the acquiring speech features and acoustic state posterior features of each speech frame in the speech data stream comprises:

3. The voice endpoint detection method of claim 1, wherein the compression encoder is determined based on the steps of:

4. The method for detecting a voice endpoint according to claim 1, wherein the step of performing silence detection on each voice frame based on the semantic fusion feature of each voice frame to obtain an initial detection probability of each voice frame includes:

5. A voice endpoint detection apparatus, comprising:

the end point detection unit is used for detecting the voice end point of the voice data stream based on the semantic fusion characteristics of each voice frame;

the feature fusion unit is specifically used for:

the end point detection unit is specifically configured to:

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the speech endpoint detection method of any of claims 1 to 4 when the program is executed.

7. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the speech end point detection method according to any of claims 1 to 4.