CN110136749B

CN110136749B - Method and device for detecting end-to-end voice endpoint related to speaker

Info

Publication number: CN110136749B
Application number: CN201910517374.3A
Authority: CN
Inventors: 俞凯; 钱彦旻; 陈烨斐; 王帅
Original assignee: Sipic Technology Co Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-06-14
Filing date: 2019-06-14
Publication date: 2022-08-16
Anticipated expiration: 2039-06-14
Also published as: CN110136749A

Abstract

The invention discloses a method and a device for detecting an end-to-end voice endpoint related to a speaker, wherein the method for detecting the end-to-end voice endpoint related to the speaker comprises the following steps: extracting acoustic features of the voice to be detected; splicing the acoustic features and the i-vector features to serve as new input features; inputting the new input features into a neural network for training and outputting a detection result whether the voice to be detected is the voice of the target speaker. The method and the device of the application realize the end-to-end speaker-related end point detection system by adding speaker-related information (i-vector) in the training process of the traditional voice end point detection system and applying a Deep Neural Network (DNN) and a long-short time memory neural network (LSTM) to voice end point detection, and can directly output the voice part of a target speaker through a single network and remove other silent sections in audio and the voice of non-target speakers.

Description

Method and device for detecting end-to-end voice endpoint related to speaker

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to an end-to-end voice endpoint method and device.

Background

In the related art, Voice Activity Detection (VAD) is a very important preprocessing step for tasks such as Voice recognition and speaker recognition. One underlying voice endpoint detection system aims to remove the silent portions of the audio, while a more general detection system can remove all irrelevant portions of the audio, including noise and the non-targeted speaker's voice.

The existing scheme mainly aims at scenes with background human voice interference (such as restaurants and the like), and provides a robust voice endpoint system which can extract the voice part of a target speaker. The system proposed by the related art is based on a Gaussian Mixture Model (GMM), and an additional GMM model is used to model the target speaker based on separate modeling of speech and noise, i.e., three GMMs are used to achieve the goal of extracting the speech part of the target speaker.

The inventor finds that the existing scheme has at least the following defects in the process of implementing the application:

the voice of the remaining non-targeted speakers is considered as background noise (the targeted speaker has significantly higher energy than the remaining speakers) and is not suitable for a multi-person conversation scenario. Secondly, when the complex environment is faced, the detection accuracy of the technology is obviously reduced.

Disclosure of Invention

Embodiments of the present invention provide a method and an apparatus for end-to-end voice endpoint related to a speaker, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a method for detecting an end-to-end voice endpoint related to a speaker, including: extracting acoustic features of the voice to be detected; splicing the acoustic features and i-vector features of the target speaker to serve as new input features; inputting the new input features into a neural network for training and outputting a prediction result of whether the voice to be detected is the voice of the target speaker.

In a second aspect, an embodiment of the present invention provides an end-to-end voice endpoint detection apparatus related to a speaker, including: the extraction module is configured to extract acoustic features of the voice to be detected; the splicing module is configured to splice the acoustic features and i-vector features of the target speaker to serve as new input features; and the output module is configured to input the new input features into a neural network for training and output a prediction result of whether the voice to be detected is the voice of the target speaker.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the speaker dependent end-to-end speech endpoint detection method of any of the embodiments of the present invention.

In a fourth aspect, the embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, and the computer program includes program instructions, when the program instructions are executed by a computer, the computer executes the steps of the speaker dependent end-to-end voice endpoint detection method according to any embodiment of the present invention.

The method and the device provided by the application extract corresponding distinctive features for different speakers, then add the features into the voice endpoint detection system, and improve the robustness of voice endpoint detection in a noise environment by using a deep learning method. Furthermore, the scheme of the application not only provides a new scheme for voice endpoint detection in a conversation scene, but also shows various possibilities of improving performance by using speaker-related features.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a method for detecting a speaker-dependent end-to-end voice endpoint according to an embodiment of the present invention;

FIG. 2 is a LSTM-based speaker dependent VAD provided in accordance with an embodiment of the present invention;

FIG. 3 is a method for feature merging according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating predicted results of different system test cases according to an embodiment of the present invention;

FIG. 5 is a block diagram of an apparatus for end-to-end speaker-dependent voice endpoint detection according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flowchart of an embodiment of a method for detecting an end-to-end voice endpoint related to a speaker according to the present application is shown, and the method for detecting an end-to-end voice endpoint related to a speaker according to the present embodiment can be applied to terminals with language models, such as smart voice tvs, smart speakers, smart dialogue toys, and other existing smart terminals with speaker detection functions.

As shown in fig. 1, in step 101, extracting acoustic features of a voice to be detected;

in step 102, the acoustic features are spliced with i-vector features of a target speaker to serve as new input features;

in step 103, the new input features are input into the neural network for training and outputting the prediction result of whether the speech to be detected is the speech of the target speaker.

In this embodiment, for step 101, the end-to-end voice endpoint detection apparatus related to the speaker first extracts the acoustic feature of the voice to be detected, then in step 102, the extracted acoustic feature is spliced with the i-vector feature which is extracted from the voice to be detected and can represent the identity of the voice to be detected, and the spliced feature is used as a new input feature. Then, in step 103, the new input feature is input into the neural network to train the neural network and output a prediction result of whether the speech to be detected is the speech of the speaker.

The scheme of the embodiment adds the i-vector characteristic of the information related to the speaker in the modeling process, and is an online detection system, so the system delay is low.

In some optional embodiments, the neural network is a deep neural network, inputting the new input features into the neural network for training, and outputting the prediction result of whether the speech to be detected is the speech of the target speaker includes: inputting each frame data in the new input features to the deep neural network respectively; and respectively outputting the detection result of whether each frame of data is the voice of the target speaker. So that it can be detected whether each frame data is the voice of the target speaker.

In some optional embodiments, the step of the neural network is a long-term and short-term memory cyclic neural network, and the step of inputting the new input features into the neural network for training and outputting the prediction result of whether the speech to be detected is the speech of the target speaker includes: inputting the whole sentence data corresponding to the new input features into the deep neural network; and outputting the prediction result of whether each frame of data is the voice of the target speaker. So that it can be detected whether the whole sentence data is the voice of the target speaker.

In some optional embodiments, before inputting the new input features into the neural network for training and outputting the prediction result of whether the speech to be detected is the speech of the target speaker, the method further includes: and combining n adjacent voice frames in the new input characteristics, then taking an average value as input, and simultaneously repeating the prediction result corresponding to each prediction output n times to form final output. Thus, by combining n adjacent speech frames in the feature input portion by taking an average, the obtained new feature is one n times the original length, which is done to enhance the continuity between the speech frames. Then after the model outputs the predicted values, each predicted value is repeated for n times, so that the length is consistent with the initially input characteristic length, and each frame is guaranteed to have corresponding predicted output. The method is used in speaker-dependent voice endpoint detection, and can solve the problem of erroneous conversion and fragmentation between voice and non-voice.

In some alternative embodiments, stitching the acoustic features with i-vector features of the targeted speaker as new input features includes: extracting i-vector characteristics of a target speaker from the voice to be detected by using a pre-trained i-vector extractor; the frame-level acoustic features and the i-vector features are concatenated as new inputs. Therefore, the extraction and the splicing of the i-vector characteristics are realized, and the voice of the target speaker is better identified.

The following description is provided to enable those skilled in the art to better understand the present disclosure by describing some of the problems encountered by the inventors in implementing the present disclosure and a specific embodiment of the finally determined solution.

The inventor discovers that in the process of implementing the application: the prior art regards the voice of an irrelevant speaker as background noise rather than normal conversation, so the voice of a target speaker cannot be extracted under the condition of similar energy, and secondly, the prior art mainly uses a traditional GMM method, does not use a deep learning method, so the system detection capability is limited.

In order to solve the defects in the prior art, the following schemes may be adopted by those skilled in the art: if the voice part of a specific Speaker is to be extracted, all voice sections in the audio are generally found out by using a common voice endpoint detection system, and then all voice sections are screened by using Speaker Verification (SV) to find out the voice part of a target Speaker. Such a two-stage solution is relatively easy to conceive.

The scheme of the application is an end-to-end neural network system, and the same work is not carried out before. The end-to-end speaker related end point detection system is realized by adding speaker related information (i-vector) in the training process of the traditional voice end point detection system and applying a Deep Neural Network (DNN) and a long-time and short-time memory neural network (LSTM) to voice end point detection, the voice part of a target speaker can be directly output through a single network, and other silent sections in audio and the voice of a non-target speaker are removed.

FIG. 2 illustrates an LSTM-based speaker dependent VAD provided by an embodiment of the present application that concatenates the acoustic features of each frame in a segment of audio with the i-vector features of the targeted speaker.

In the part of the lowest feature input, the acoustic features are directly input in the traditional method, while the speaker-related information (i-vector) is added, and the acoustic features of each frame are spliced with the characterization i-vector of the target speaker to serve as new input features. After training of the intermediate neural network, the prediction result of whether each frame is the voice of the target speaker can be directly output. The intermediate neural network portion may be replaced with other networks, except that the DNN input is the data for each frame after splicing, while the LSTM input is the data for the entire sentence after splicing.

As shown in FIG. 3, a feature merging method is described herein for use in speaker dependent speech endpoint detection to solve the problem of false transitions between speech and non-speech and the "fragmentation problem". Firstly, n adjacent speech frames are combined in an averaging mode in the characteristic input part, and the obtained new characteristic is one-n times of the original characteristic in length, so that the aim of enhancing the continuity between the speech frames is fulfilled. Then after the model outputs the predicted values, each predicted value is repeated for n times, so that the length is consistent with the initially input characteristic length, and each frame is guaranteed to have corresponding predicted output.

The embodiment of the application provides a new task: speaker Dependent Voice endpoint Detection (SDVAD) is the process of extracting the Voice portion of a target Speaker from the audio alone. This task is common in real life scenarios, and the conventional solution is to use Speaker Verification (SD) to screen out the voice part of the target Speaker after all voice segments are recognized by the voice endpoint detection system. In the embodiment of the present application, we propose a method of end-to-end, neural network based approach to solve this problem, which adds speaker-related information during modeling, and is an online detection system with very low system delay. Based on the Switchboard dataset, we generated a speech dataset for a telephone conversation scenario and performed some experiments on this dataset. Experimental results show that compared with a method of voice endpoint detection and speaker confirmation, the on-line detection system provided by the inventor has better effects on frame-level accuracy and F-score indexes. We also performed a more comprehensive analysis of the different systems using the evaluation index at the paragraph level we previously proposed.

The procedures and experiments performed by the inventors in order to enable those skilled in the art to better understand the scheme of the present application are described below.

Brief introduction to the drawings

Voice endpoint detection (VAD) is one of the most critical techniques in speech signal processing to separate speech from non-speech segments within the audio. VADs are commonly used as pre-processing steps for various speech processing tasks, such as Automatic Speech Recognition (ASR), speech synthesis, speaker recognition, and Voice over Internet protocol (VoIP). The quality of the VAD directly affects the performance of subsequent tasks.

In conventional VAD systems, the non-speech portion is typically composed of silence and noise, and in this work, the non-speech portion also includes the speech portion of the non-target speaker. This is very common in practical applications, for example, a voice assistant may only need to reply to a particular speaker's command, or in a conversational environment, speech from a non-targeted speaker should be considered non-speech. The problem addressed is called speaker dependent voice endpoint detection (SDVAD), which is an extension of the traditional VAD task. In this task, we only want to detect speech from the targeted speaker, so both speech from non-targeted speakers and ambient noise will be ignored. The simple method of this task has two steps: (1) detecting all speech segments using a normal VAD system (2) speaker verification is performed on the obtained speech segments to identify the speech of the target speaker. However, this method is performed in an off-line manner and has a large delay.

Conventional VAD algorithms can be divided into two categories, feature-based methods and model-based methods. With respect to feature-based methods, different acoustic features, such as time domain energy, zero crossing rate, etc., are first extracted and then detected using methods such as threshold comparison. With respect to model-based methods, separate statistical models are trained to represent speech and non-speech segments with different probability distributions, and the determination is made by a posterior probability, which mainly includes a Gaussian Mixture Model (GMM), a Hidden Markov Model (HMM), and the like. Meanwhile, a discriminative model can be directly trained to distinguish between speech and non-speech, such as a Support Vector Machine (SVM) and a deep neural network model.

Recently, deep learning approaches have been successfully applied to many tasks including VAD. For VAD in complex environments, DNN has better modeling capability than traditional methods, a Recurrent Neural Network (RNN) and a long-term memory network (LSTM) can better simulate continuity between inputs, and a Convolutional Neural Network (CNN) can generate better characteristics for neural network training.

To address speaker-dependent voice endpoint detection, we propose a neural network-based system that incorporates speaker-dependent information into a conventional VAD system. Specifically, the feature (i-vector) of the target speaker is added to the acoustic feature. Compared to the two-stage VAD/SV method, the method proposed can be implemented as an end-to-end online system, with relative delay. Experiments were performed on data generated from a multi-person conversation based on the Switchboard dataset and the results show that our proposed online method can achieve better performance and reduce latency compared to the offline VAD/SV method.

The first application in speech endpoint detection based on neural networks is the deep neural network model (DNN). The DNN-based VAD system not only can obtain better effect, but also has low detection complexity. Typical DNN-based VAD systems train a frame-based binary classifier to classify each frame into two classes: speech and non-speech. The input to DNN is, in general, the acoustic features of each frame plus the extension of the preceding and following frames,

Ot＝[xt-r，...，xt-1，xt，xt+1，...，xt+r] (1)

where r is the length of the context extension. DNN is optimized by a loss function of cross entropy. For each frame, classification is performed by comparison between the posterior probabilities of the two classes.

LSTM-based VAD system

LSTM is able to model sequences and capture long-term correlations in a range of features. Its core is made up of special units called blocks. Each block contains an input gate, an output gate and a forgetting gate, enabling the model to memorize short or long time information dependencies. The LSTM structure can efficiently use context to sequentially model the input acoustic features.

The LSTM network computes a mapping from the input sequence x ═ x1, x 2.., xT ] to the output sequence y ═ y1, y 2.., yT ]. For more details of this architecture, reference may be made to the relevant paper.

If applied to VAD, LSTM-based systems output predictions on a frame-by-frame basis, but each prediction for the current frame depends in part on its history. The training criteria are the same as DNN.

Speaker dependent voice endpoint detection related work

For speaker-dependent VADs, some previous studies used microphone arrays to track targeted speakers. Some studies also consider the speaker identity of VAD, which uses a VAD system based on a gaussian mixture model using an additional GMM to model the target speaker. However, it should be noted that we have different experimental environments from the previous work and also solve different problems. In previous studies, speech from other speakers appeared as background noise, while in our task, it was aimed at conversational scenarios where speech from different speakers did not overlap. Another situation is when the smart audio is used at home where the speech recognition system is disturbed by the conversation of other family members. Typically, such speaker-dependent speech endpoint detectors are required for systems that only want to accept speech signals from a particular speaker.

Speaker-based feature i-vector

Speaker modeling plays a crucial role in speech processing tasks, such as speaker recognition, speaker segmentation clustering, speaker adaptation for speech recognition. In recent years, the i-vector system based on factor analysis achieves remarkable performance improvement in a speaker recognition task, and the speaker characterization mode is also suitable for speaker adaptive training of other related tasks such as voice conversion and voice recognition.

Basically, the i-vector is a low dimensional fixed length representation of speech that retains speaker specific information. For the i-vector framework, the speaker and conversation related supervectors M (derived from UBM) are modeled as

M＝m+Tw (2)

Where m is the speaker and session independent supervector, T is a low rank matrix representing speaker and session variability, and i-vector is the posterior mean of w.

Baseline system

As mentioned in the introduction, the intuitive approach would be a two-stage approach to the task of a speaker-dependent VAD. First, a common VAD is used to detect all speech segments without speaker differentiation, and then we use a speaker verification system to pick out the speech segments that belong to the target speaker. Thus, the baseline system is a combination of VAD and a text-independent speaker verification system, which is referred to as the VAD/SV method in the remainder of this document.

In this work, the DNN and LSTM based systems were trained for the VAD phase, while for the speaker verification section, we used the most advanced currently i-vector based probabilistic Linear discriminant analysis (i-vector/PLDA framework).

End-to-end speaker dependent VAD System (SDVAD)

According to the baseline system, the speaker verification stage is after obtaining VAD predictions of the entire audio, which adds to the latency of the system. Moreover, it does not directly optimize the final goal of this task. Therefore, we propose to introduce speaker modeling in the original VAD network so that the model can provide frame-level speaker dependent prediction. Since the model is now trained in an end-to-end fashion, the data information can be leveraged for better results.

The proposed system is depicted in the aforementioned FIG. 2, which will extract the i-vector of the target speaker from the user's registered speech using a pre-trained i-vector extractor. Then we concatenate the acoustic features at the frame level and the i-vector of the target speaker as the new input to the neural network. This is possible for both the training and reasoning phases. For the training phase, the audio data is labeled accordingly, so that speaker-specific data can be used to extract the corresponding i-vector. It is also reasonable to require the user to register their voice first when using the system for the first time during the reasoning phase.

During the training process, only the speech portion of the target speaker is treated as a positive sample, while both the speech portion and the non-speech portion of the non-target speaker are treated as negative samples. Thus, the model can directly output the final speaker dependent prediction for each frame without the need for an additional speaker verification stage. Our proposed speaker-dependent VAD system is an online system with low latency.

Post-processing and feature classification

VAD is different from the common binary classification problem because the characteristics of the audio signal are continuous, which means that adjacent frames are highly correlated. The raw output of the model typically contains many false transitions, resulting in "fragmentation problems" due to impulse noise and other disturbances. This problem is more pronounced for frame-based classifiers like DNN. Therefore, it is important to apply post-processing methods to smooth the original output of the model and reduce frequent false transitions between speech and non-speech. A rule-based post-processing method is commonly used that uses a sliding window to smooth the output of the model, eliminating some erroneous speech-to-non-speech transitions.

Most post-processing methods add additional delay to the online VAD system. In this context, another approach, called feature merging, is used to help solve the "fragmentation problem" in VAD associated with speakers. Except that we try to smooth the input features instead of the model output. With VAD, feature merging is done by grouping values into a fixed number of blocks. As shown in fig. 3, we use mean reduction to merge the input features of adjacent n frames, where the frames do not overlap during the merging process. This process reduces the number of original frames to 1/n. Each output prediction of the model is then repeated n times to correspond to the original features of each frame. The delay caused by this method is negligible.

For the DNN model, normal frame extension is used to add context information and reduce false transitions in the prediction results. For the LSTM model, we use feature merging to preserve speech continuity and reduce computation time.

Experiment of

Data set

We have experimented with dialog datasets generated from the Switchboard corpus. After removing the repeated conversations and the speakers with insufficient data, we still leave 250h audio data for 500 speakers, where each audio contains only one speaker. We then split these filtered data into train, dev and test set. There were 450 speakers in the training set, 10 speakers in the development set, and the remaining 40 speakers in the testing set.

The training data is generated as follows:

(1) and extracting i-vector for the speakers in the training set.

(2) The ith audio is randomly selected from the data of the s-th speaker, called utt _ s _ i, and the jth audio of the t-th speaker, where s is not equal to j, is selected at the same time, and the two pieces of audio are spliced together as a new sentence utt _ new.

(3) The i-vector of the targeted speaker is connected to utt _ new audio for each frame to form the final input to the neural network. The generation of development data and test data is similar, while we assume that the i-vector of the target speaker is obtained through an additional registration phase.

Feature(s)

For the i-vector extractor, 20-dimensional MFCCs with a frame length of 25ms are extracted as front-end features. The UBM consists of 2048 GMMs, and the extracted i-vector has a dimension of 200. The PLDA is used as a score and compensates for channel distortion. The basic characteristics of all neural networks are 36-dimensional log filter banks, the frame length is 25ms, and the frame shift is 10 ms. For the DNN model, the input layer is formed by a contextual window of 11 frames. Both the DNN and LSTM models contain two hidden layers.

Frame level evaluation

The results of the frame-level evaluation are reported in Accuracy (ACC) and F-score (F1, harmonic mean of precision and recall), listed in table 1.

Without any pre-or post-processing, LSTM can be found to have better performance than DNN in VAD/SV baseline and SDVAD systems due to its sequence modeling capability. For the LSTM SDVAD system, the ACC and F-score of the SDVAD system are slightly higher than the VAD/SV baseline system, which means that our proposed speaker-dependent VAD method is effective. .

To address the "fragmentation problem" and further improve system performance, the aforementioned rule-based post-processing and feature merging is applied to these systems. From the results, it can be seen that the post-treatment can slightly improve the performance of DNN and LSTM SDVAD

Table 1: ACC (%) and F score (%) for different systems. VAD/SV represents VAD followed by speaker verification, with a two-part baseline system, while SDVAD represents our proposed end-to-end speaker-dependent VAD system. "+ post" and "+ binning" denote applying post-processing and feature merging, respectively. For post-processing, the size of the sliding window is 10 frames. The size of the feature merge is 4.

On the other hand, the feature merging approach we used can greatly benefit LSTM-based SDVAD systems, increasing ACC from 88.31% to 94.42%, and can be further increased to 94.62% by post-processing. F-score has the same improvement as ACC.

We need to note here that as a first stage of the baseline system, a normal VAD can get good accuracy of speech/non-speech classification (no speaker discrimination) without too much fragmentation. The feature merging has no significant impact on the first stage and therefore does not improve the overall VAD/SV system. For the same reason, the post-processing method cannot improve the VAD/SV system, so the post-processed VAD/SV results are not added to table 1. The reason for the performance difference between the two processing methods is that the post-processing operation does not affect the training process of the SDVAD, while the feature classification as a pre-processing step can be considered as part of the neural network, which helps the network to make full use of the information.

Paragraph level evaluation

ACC and F scores are only an indication of frame level classification capability. We wish to further investigate the performance of VAD/SV baseline and SDVAD systems at the paragraph level. Here we use the evaluation metric J VAD we previously proposed.

The J _ VAD contains four different sub-criteria, namely Start Boundary Accuracy (SBA), End Boundary Accuracy (EBA), boundary accuracy (BP) and frame Accuracy (ACC). ACC is the accuracy of correctly identifying a frame. SBA and EBA are indications of boundary level accuracy. BP is an indicator that measures the integrity of the VAD output segment. The harmonic mean of the above four sub-criteria is defined as the segment level J _ VAD. The analysis was performed from these four aspects. Paragraph level J _ VAD results are shown in Table 2

For a more intuitive comparison, only the LSTM model is used here. In contrast to the VAD/SV baseline system, we can find that the original SDVAD system is limited by the "fragmentation problem". Without any pre-and post-processing methods, the prediction of the SDVAD system may contain some erroneous state transitions and fragments. These fragments result in an increased number of paragraphs being detected by different systems.

Table 2: j _ VAD (%) and 3 sub-standards (%) for different systems are listed, except for ACC, as shown in Table 1

Therefore, BP evaluation was poor. Feature merging can effectively reduce these false transitions. All paragraph assessments were improved and approached the effect of the baseline system.

To better compare different systems, the predicted results of test cases are shown in FIG. 4. Fig. 4 shows predictions for different systems, it can be observed that some segments exist in the prediction results of the SDVAD system, and feature merging can effectively solve the "fragmentation problem".

The VAD/SV system provides some false positives for non-target speakers, which is reasonable because VAD and SV are two separate stages that cannot be optimized for the final target of the task.

Conclusion

In this context, end-to-end neural network based systems are designed to address the VAD problem associated with a speaker, which is intended to detect only speech from a targeted speaker. Compared to the more delayed two-stage VAD/SV method, our proposed end-to-end method (SDVAD) directly brings speaker information into the modeling process and can directly perform online prediction. Results of a series of experiments were obtained based on the frame-level index and the paragraph-level index we previously proposed. For the frame-level evaluation, our proposed LSTM SDVAD system has a significant performance improvement over the conventional VAD/SV system, from 86.62% to 94.42% in terms of frame accuracy. To solve the "fragmentation problem", we introduce feature merging in the LSTM SDVAD system, which significantly improves the evaluation effect at the paragraph level.

Referring to fig. 5, a block diagram of an apparatus for detecting a speaker dependent end-to-end voice endpoint according to an embodiment of the present invention is shown.

As shown in fig. 5, the end-to-end voice endpoint detection apparatus 500 includes an extraction module 510, a concatenation module 520, and an output module 530.

The extracting module 510 is configured to extract an acoustic feature of the speech to be detected; a splicing module 520 configured to splice the acoustic features with i-vector features of the target speaker as new input features; and an output module 530 configured to input the new input features into a neural network for training and output a prediction result of whether the speech to be detected is the speech of the target speaker.

In some optional embodiments, the splicing module 520 is further configured to: extracting i-vector characteristics of a target speaker from the voice to be detected by using a pre-trained i-vector extractor; the frame-level acoustic features and the i-vector features of the target speaker are concatenated as a new input.

It should be understood that the modules recited in fig. 5 correspond to various steps in the method described with reference to fig. 1. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 5, and are not described again here.

It should be noted that the modules in the embodiments of the present application are not intended to limit the solution of the present application, and for example, the extraction module may be described as a module that extracts the acoustic features of the speech to be detected. In addition, the related functional modules may also be implemented by a hardware processor, for example, the extraction module may also be implemented by a processor, which is not described herein again.

In other embodiments, an embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, where the computer-executable instructions may execute the end-to-end voice endpoint detection method related to a speaker in any of the above method embodiments;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

extracting acoustic features of the voice to be detected;

splicing the acoustic features and i-vector features of the target speaker to serve as new input features;

inputting the new input features into a neural network for training and outputting a prediction result of whether the voice to be detected is the voice of the target speaker.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the speaker dependent end-to-end voice endpoint detection apparatus, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, and the remote memory may be connected to the speaker dependent end-to-end voice endpoint detection apparatus over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the above speaker dependent end-to-end speech endpoint detection methods.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 6, the electronic device includes: one or more processors 610 and a memory 620, with one processor 610 being an example in fig. 6. The apparatus of the end-to-end voice endpoint method may further comprise: an input device 630 and an output device 640. The processor 610, the memory 620, the input device 630, and the output device 640 may be connected by a bus or other means, such as the bus connection in fig. 6. The memory 620 is a non-volatile computer-readable storage medium as described above. The processor 610 executes various functional applications of the server and data processing by running non-volatile software programs, instructions and modules stored in the memory 620, so as to implement the speaker-dependent end-to-end voice endpoint detection method of the above method embodiment. The input device 630 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the end-to-end speech endpoint detection device associated with the speaker. The output device 640 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device is applied to an end-to-end voice endpoint detection apparatus related to a speaker, and includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

extracting acoustic features of the voice to be detected;

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for speaker dependent end-to-end voice endpoint detection, comprising:

extracting acoustic features of the voice to be detected;

inputting the new input features into a neural network and obtaining a prediction result of whether the voice to be detected output by the neural network is the voice of the target speaker, wherein in the training process of the neural network, only the voice part of the target speaker is regarded as a positive sample, and the voice part and the non-voice part of the non-target speaker are regarded as negative samples;

and outputting the voice part of the target speaker based on the prediction result so as to remove the silence segment in the voice to be detected and the voice of the non-target speaker.

2. The method according to claim 1, wherein the neural network is a deep neural network, and the inputting the new input features into the neural network and obtaining the prediction result of whether the speech to be detected output by the neural network is the speech of the target speaker comprises:

inputting each frame data in the new input features to a deep neural network respectively;

and respectively outputting the detection result of whether each frame of data is the voice of the target speaker.

3. The method according to claim 1, wherein the neural network is a long-term and short-term memory recurrent neural network, and the inputting the new input feature into the neural network and obtaining the prediction result of whether the speech to be detected output by the neural network is the speech of the target speaker comprises:

inputting the whole sentence data corresponding to the new input features into a long-term memory recurrent neural network;

and outputting the detection result of whether each frame of data is the voice of the target speaker.

4. The method according to claim 1, wherein before the inputting the new input features into a neural network and obtaining the prediction result of whether the speech to be detected output by the neural network is the speech of the target speaker, the method further comprises:

the n speech frames that are adjacent in the new input feature are combined and then averaged.

5. The method according to claim 4, wherein after said outputting the prediction result of whether the speech to be detected is the speech of the target speaker, the method further comprises:

and repeating the corresponding prediction result of each speech frame n times to form a final output.

6. The method of any one of claims 1-5, wherein the stitching the acoustic feature with an i-vector feature as a new input feature comprises:

extracting i-vector characteristics of a target speaker from the voice to be detected by using a pre-trained i-vector extractor;

the frame-level acoustic features and the i-vector features of the target speaker are concatenated as a new input.

7. A speaker dependent end-to-end speech endpoint detection apparatus, comprising:

the extraction module is configured to extract acoustic features of the voice to be detected;

the splicing module is configured to splice the acoustic features and i-vector features of the target speaker to serve as new input features;

the output module is configured to input the new input features into a neural network and acquire a prediction result of whether the speech to be detected output by the neural network is the speech of the target speaker, wherein in the training process of the neural network, only the speech part of the target speaker is regarded as a positive sample, and the speech part and the non-speech part of the non-target speaker are regarded as negative samples; and outputting the voice part of the target speaker based on the prediction result so as to remove the silence section in the voice to be detected and the voice of the non-target speaker.

8. The apparatus of claim 7, wherein the stitching module is further configured to:

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 6.

10. A storage medium having stored thereon a computer program, characterized in that the program, when being executed by a processor, is adapted to carry out the steps of the method of any one of claims 1 to 6.