CN110428854B

CN110428854B - Voice endpoint detection method and device for vehicle-mounted terminal and computer equipment

Info

Publication number: CN110428854B
Application number: CN201910740881.3A
Authority: CN
Inventors: 杨柳
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-08-12
Filing date: 2019-08-12
Publication date: 2022-05-06
Anticipated expiration: 2039-08-12
Also published as: CN110428854A

Abstract

The application relates to a voice endpoint detection method and device of a vehicle-mounted end, a computer readable storage medium and computer equipment, wherein the method comprises the following steps: acquiring the current vehicle-mounted network condition of a vehicle-mounted end and a voice data stream to be detected; when the current vehicle-mounted network condition is determined not to meet the preset online detection condition, extracting a voice characteristic vector from the voice data stream; performing voice endpoint analysis on the voice characteristic vector through a voice endpoint detection model of the vehicle-mounted end to obtain a voice endpoint analysis result, wherein the voice endpoint detection model is obtained by training based on voice training data obtained in different environments; and determining a voice endpoint in the voice data stream according to the voice endpoint analysis result to obtain a voice endpoint detection result. The scheme provided by the application can improve the accuracy of the vehicle-mounted end voice endpoint detection.

Description

Voice endpoint detection method and device for vehicle-mounted terminal and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for detecting a voice endpoint at a vehicle-mounted terminal, a computer-readable storage medium, and a computer device.

Background

With the development of computer technology, Speech Recognition technology, also called Automatic Speech Recognition (ASR), has gained more and more attention. The speech recognition technology is a technology for a machine to convert a speech signal into a corresponding text or command through a recognition and understanding process, and relates to the fields of signal processing, pattern recognition, probability theory and information theory, vocalization mechanism and auditory mechanism, artificial intelligence, and the like. In speech recognition, it is necessary to detect a speech signal from continuous speech data by Voice Activity Detection (VAD) to perform subsequent recognition processing on the speech signal. Traditional voice endpoint detection is based on energy algorithm implementation.

However, the conventional voice endpoint detection based on the energy algorithm is difficult to compatibly process the voice signals with various source noises, and the phenomena that the front endpoint spits words and the rear endpoint cannot stop easily occur in a complex noise environment, such as a vehicle-mounted environment, and the accuracy of the voice endpoint detection is low.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, an apparatus, a computer-readable storage medium, and a computer device for detecting a voice endpoint of a vehicle-mounted terminal, which can improve accuracy of detecting the voice endpoint of the vehicle-mounted terminal.

A voice endpoint detection method of a vehicle-mounted end comprises the following steps:

acquiring the current vehicle-mounted network condition of a vehicle-mounted end and a voice data stream to be detected;

when the current vehicle-mounted network condition is determined not to meet the preset online detection condition, extracting a voice feature vector from the voice data stream;

performing voice endpoint analysis on the voice feature vector through a voice endpoint detection model of the vehicle-mounted end to obtain a voice endpoint analysis result, wherein the voice endpoint detection model is obtained by training based on voice training data obtained in different environments;

and determining the voice endpoint in the voice data stream according to the voice endpoint analysis result to obtain a voice endpoint detection result.

A voice endpoint detection apparatus of a vehicle-mounted terminal, the apparatus comprising:

the voice data acquisition module is used for acquiring the current vehicle-mounted network condition of the vehicle-mounted terminal and the voice data stream to be detected;

the voice feature extraction module is used for extracting a voice feature vector from the voice data stream when the current vehicle-mounted network condition is determined not to meet the preset online detection condition;

the endpoint analysis module is used for carrying out voice endpoint analysis on the voice feature vector through a voice endpoint detection model of the vehicle-mounted end to obtain a voice endpoint analysis result, and the voice endpoint detection model is obtained by training based on voice training data obtained in different environments;

and the endpoint determination module is used for determining the voice endpoint in the voice data stream according to the voice endpoint analysis result to obtain a voice endpoint detection result.

A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method as described above.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the method as described above.

According to the voice endpoint detection method and device of the vehicle-mounted end, the computer-readable storage medium and the computer equipment, when the current vehicle-mounted network condition does not meet the preset online detection condition and online detection cannot be performed, voice endpoint detection is performed on the voice data stream to be detected through the voice endpoint detection model obtained by training the vehicle-mounted end based on the voice training data obtained in different environments, voice data of various noise environments can be processed in a compatible mode, and accuracy of voice endpoint detection of the vehicle-mounted end is improved.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a voice endpoint detection method at a vehicle end;

FIG. 2 is a schematic flow chart illustrating a voice endpoint detection method at a vehicle-mounted end according to an embodiment;

FIG. 3 is a flow diagram illustrating the process of extracting speech feature vectors according to one embodiment;

FIG. 4 is a schematic diagram illustrating a head-to-tail end point-to-point comparison of test data according to one embodiment;

FIG. 5 is a block diagram illustrating an exemplary embodiment of a voice endpoint detection apparatus at a vehicle-mounted terminal;

FIG. 6 is a block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Fig. 1 is an application environment diagram of a voice endpoint detection method of a vehicle-mounted terminal in one embodiment. Referring to fig. 1, the voice endpoint detection method of the vehicle-mounted terminal is applied to a voice recognition system of the vehicle-mounted terminal. The voice recognition system includes a terminal 110 provided in a vehicle, and the terminal 110 automatically recognizes a voice in the vehicle. Specifically, when the current vehicle-mounted network status does not satisfy the preset online detection condition and online detection cannot be performed, the terminal 110 performs voice endpoint detection on the voice data stream to be detected through a voice endpoint detection model obtained by training the vehicle-mounted terminal based on the voice training data obtained in different environments. In addition, the terminal 110 may also be connected to a server (not shown) through a network, and when the current vehicle-mounted network condition satisfies an online detection condition, the server performs voice endpoint detection and receives a voice endpoint detection result returned by the server. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

As shown in fig. 2, in one embodiment, a voice endpoint detection method at a vehicle side is provided. The embodiment is mainly illustrated by applying the method to the terminal 110 in fig. 1. Referring to fig. 2, the method for detecting the voice endpoint of the vehicle-mounted terminal specifically includes the following steps:

s202: and acquiring the current vehicle-mounted network condition of the vehicle-mounted terminal and the voice data stream to be detected.

Key technologies of Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS, texttostech), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future. The automatic voice recognition technology is applied to the vehicle-mounted end to assist a driver to control vehicle driving through voice, and is an important application direction of the automatic voice recognition technology. Voice endpoint detection, i.e., detecting a voice signal from continuous voice data, is an important premise for realizing that a voice recognition technology can meet the real-time application requirements. The voice endpoints include a head endpoint (also called a front endpoint, a start endpoint) and a tail endpoint (also called a back endpoint, a stop endpoint) of the voice signal, the voice signal starts from the head endpoint and ends at the tail endpoint, specifically, the head endpoint may be a time when the voice signal starts, and the tail endpoint may be a time when the voice signal ends. Voice endpoint detection is just to determine the head end and tail end of a voice signal in a piece of voice data by detection, and can determine the existence of the voice signal or detect the voice signal from continuous voice data. Through VAD processing, effective voice data in the voice data, namely the head and tail end points of the voice signals, can be accurately identified, audio coding can be optimized, network data transmission can be reduced, and therefore the processing efficiency of voice identification is improved, and the high real-time requirement of a vehicle-mounted end is met. Currently, a commonly used AVD specifically sets a series of assumed conditions and parameter thresholds based on an energy algorithm to control VAD detection effects, such as calculating energy and spectral entropy of a single voice frame in voice data, and energy mean and spectral entropy mean of a plurality of voice frames, and determines each voice frame to be a voice frame or a mute frame by comparing the energy and the spectral entropy with the energy mean and the spectral entropy mean, and further determines a voice endpoint. However, in the VAD based on the energy algorithm, in the vehicle-mounted complex noise environment, the phenomenon that the front end point spits words and the rear end point cannot stop easily occurs, so that the accuracy of the VAD is low. Even if some Trick means are used at Linux, QNX, Android and other operating system ends to avoid processing, such as caching data N seconds before VAD detection, closing VAD, setting ASR timeout time, relying on cloud ASR detection results and other processing, under complex Noise scenes of vehicle-mounted engines, tire dryness, external whistle and the like, the Signal-to-Noise Ratio (SNR) is low, and various source noises are difficult to be processed compatibly through parameter adjustment, so that the overall accuracy of VAD is still limited. Therefore, the embodiment provides the vehicle-mounted end voice endpoint detection method capable of improving the vehicle-mounted end voice endpoint detection accuracy.

In this embodiment, the current in-vehicle network condition may be a network state in the vehicle. For example, a terminal installed in a vehicle is connected to a server via a network, and the current vehicle-mounted network condition may reflect the quality of the network connection. The quality of the network connection can be detected in real time to obtain a detection result including the current vehicle-mounted network condition. The voice data stream to be detected is voice data in a vehicle which needs to be detected, and particularly, voice data of people in the vehicle can be collected in real time through voice collecting equipment such as a microphone to obtain the voice data stream to be detected.

S204: and when the current vehicle-mounted network condition is determined not to meet the preset online detection condition, extracting the voice characteristic vector from the voice data stream.

The online detection condition is set according to actual requirements, and is used for judging whether the remote server needs to process the online detection condition. The online detection condition may be that the online detection condition is satisfied when a predetermined network speed threshold is exceeded, that is, the current vehicle-mounted network condition exceeds the predetermined network speed threshold. The speech feature vector represents the speech feature of the speech data stream, and the judgment of the silent frame or the speech frame can be performed according to the speech feature vector. In specific implementation, different speech feature vectors, such as FBank feature vectors, may be provided according to different speech feature extraction algorithms.

In this embodiment, after the current vehicle-mounted network condition of the vehicle-mounted terminal is obtained, the preset online detection condition is further obtained, the current vehicle-mounted network condition is compared with the online detection condition, if the current vehicle-mounted network condition does not satisfy the online detection condition, which indicates that the detection processing cannot be performed by the remote server, the current vehicle-mounted network condition is directly processed by the local terminal of the vehicle, and the voice feature vector is extracted from the obtained voice data stream. In specific implementation, when extracting the speech feature vector from the obtained speech data stream, the speech data stream may be preprocessed, such as pre-emphasis, framing, hamming window adding, time domain signal to frequency signal conversion, for example, by Fast Fourier Transform (FFT), and then the speech feature vector is obtained by the Mel filter bank, such as the FBank feature vector.

S206: and performing voice endpoint analysis on the voice characteristic vector through a voice endpoint detection model of the vehicle-mounted end to obtain a voice endpoint analysis result, wherein the voice endpoint detection model is obtained by training based on voice training data obtained in different environments.

The voice endpoint detection model is obtained by training based on voice training data obtained in different environments and used for carrying out voice endpoint analysis on the voice characteristic vectors to obtain a voice endpoint analysis result. Specifically, the voice endpoint detection model may be a DNN network model, the complex classification capability of the DNN (huge multidimensional parameters and nonlinear characteristics, high-level abstraction may be performed on the voice data using multiple processing layers including a complex structure or formed by multiple nonlinear transformations), the DNN network model has a stronger expression capability on the input features after being extended to a combination of multiple neurons, a stronger function (weight W and Bias) may be obtained through a training network to fit the input features, various noises may be learned from a large amount of voice feature data, the robustness of VAD detection on vehicle-mounted scene noise is improved, thereby the voice data in various noise environments is compatibly processed, and the accuracy of vehicle-mounted end voice endpoint detection is improved.

S208: and determining a voice endpoint in the voice data stream according to the voice endpoint analysis result to obtain a voice endpoint detection result.

And after a voice endpoint analysis result output by the voice endpoint detection model is obtained, determining a voice endpoint in the voice data stream according to the voice endpoint analysis result to obtain a voice endpoint detection result. The voice endpoints may include a beginning point (i.e., a start time of the voice signal) and an ending point (i.e., an end time of the voice signal) of the voice signal, and the obtained voice endpoints may be directly used as a voice endpoint detection result. In addition, the voice endpoint detection result may also include the length of the voice signal to be detected and processed according to actual requirements, and specifically may be obtained according to the time span between the detected leading endpoint and the detected trailing endpoint.

During specific implementation, the voice endpoint analysis result includes a prediction result that each voice data frame in the voice data stream to be detected is a silent frame or a voice frame, and the final voice head endpoint and tail endpoint time is obtained through a smoothening processing program so as to determine the voice endpoint in the voice data stream. Specifically, a threshold T may be set, where T may be 17, the currently detected voice data frame is set to be N, and if the frame length is 10ms, from the beginning of input, only when T consecutive voice frames are detected, the voice head is considered to be at (N-T) × 10; from the head end, only when T continuous non-speech frames are detected, the tail end of the speech is considered to be at (N-T) × 10 time; thereby obtaining the voice endpoint detection result.

According to the voice endpoint detection method of the vehicle-mounted end, when the current vehicle-mounted network condition does not meet the preset online detection condition and online detection cannot be carried out, voice endpoint detection is carried out on a voice data stream to be detected through a voice endpoint detection model obtained by training the vehicle-mounted end based on voice training data obtained in different environments, voice data of various noise environments can be processed in a compatible mode, and accuracy of voice endpoint detection of the vehicle-mounted end is improved.

In one embodiment, as shown in fig. 3, the step of extracting the voice feature vector, that is, when it is determined that the current vehicle-mounted network condition does not satisfy the preset online detection condition, extracting the voice feature vector from the voice data stream includes:

s302: and comparing the current vehicle-mounted network condition with a preset online detection condition.

In this embodiment, when the current vehicle-mounted network status does not satisfy the online detection condition and the vehicle-mounted terminal needs to perform the voice endpoint detection locally, the obtained voice data stream to be detected is framed, and the voice feature vector of each voice frame is extracted. Specifically, after the current vehicle-mounted network condition of the vehicle-mounted terminal is obtained, a preset online detection condition is inquired, if the preset online detection condition exceeds a preset network speed threshold value, the current vehicle-mounted network condition is compared with the online detection condition, if the current vehicle network speed determined according to the current vehicle-mounted network condition is compared with the preset network speed threshold value, whether the current vehicle-mounted network condition meets the online detection condition is judged, and a processing main body of voice endpoint detection is determined to be a local terminal of the vehicle-mounted terminal or a server of a cloud side.

S304: and when the current vehicle-mounted network condition does not meet the online detection condition, framing the voice data stream to obtain each voice frame.

When the current vehicle-mounted network condition does not meet the online detection condition, if the current vehicle network speed does not exceed a preset network speed threshold value, the current vehicle-mounted network condition is judged to not meet the online detection condition, and voice endpoint detection processing needs to be directly carried out by a local terminal of a vehicle-mounted terminal, and then voice data streams are framed to obtain voice frames. The frame parameters of each speech frame, such as the frame length and the frame shift, may be determined by the frame parameters set during framing, for example, the frame length K may be 25ms, and the frame shift Q may be 10 ms. In a specific implementation, when the voice data stream is framed, pre-emphasis processing may be performed on the voice data stream, and then framing processing may be performed to obtain each voice frame.

S306: and extracting the characteristics of each voice frame and the upper and lower related frames of the voice frame, and combining to obtain the voice characteristic vector corresponding to each voice frame.

After the voice data stream is framed to obtain each voice frame, extracting the characteristics of each voice frame and the upper and lower related frames of the voice frame, if the upper and lower related frames can be 5 frames before and after the voice frame, extracting the characteristics of the current voice frame and the characteristics of the 5 frames before and after the current voice frame. And combining the voice frames and the characteristics of the upper and lower related frames of the voice frames to obtain the voice characteristic vector corresponding to each voice frame. In specific implementation, 40-dimensional FBank feature vectors of each speech frame can be extracted through a Mel filter bank, then 40-dimensional FBank feature vectors of 5 frames before and after each speech frame are extracted, and then the 40-dimensional FBank feature vectors are combined to obtain 440-dimensional speech feature vectors, if the number of the currently detected speech data frame is 0, the constructed speech feature vectors can be related to a range of [ -5-4-3-2-1012345 ], and further the 440-dimensional speech feature vectors can be [ x0, x1, x2,. till.. xi …, x438, x439], wherein xi is the i-dimensional speech feature in the speech feature vectors. The voice feature vector reflects the voice feature of the voice data stream, and the judgment of a mute frame or a voice frame can be carried out according to the voice feature vector, so that the detection processing of a voice endpoint is realized. By combining the characteristics of the upper and lower associated frames of each speech frame, the context association between the speech frames can be ensured, and the detection accuracy is improved.

In one embodiment, after acquiring the current vehicle-mounted network condition of the vehicle-mounted terminal and the voice data stream to be detected, the method further comprises the following steps: when the current vehicle-mounted network condition is determined to meet the preset online detection condition, uploading the voice data stream to a server, so that the server performs endpoint detection on the voice data stream through a voice endpoint detection model of a server end; and receiving a voice endpoint detection result corresponding to the voice data stream sent by the server.

In this embodiment, if the current vehicle-mounted network status of the vehicle-mounted terminal is better, the remote server may perform the voice endpoint detection processing to directly receive the voice endpoint detection result. Specifically, when it is determined that the current vehicle-mounted network condition meets the preset online detection condition, the current network condition is good, online detection processing can be performed, and the voice data stream is uploaded to the server, so that the server performs endpoint detection on the voice data stream through a voice endpoint detection model at the server side. The voice endpoint detection model of the server side can be obtained by training based on voice training data obtained in different environments. Compared with a voice endpoint detection model of a vehicle-mounted end, the voice endpoint detection model of the server end can be dynamically updated in real time according to a detection processing result, and accuracy of voice endpoint detection is further improved. After the voice data stream is uploaded to the server, the server carries out voice endpoint detection on the uploaded voice data stream, the obtained voice endpoint detection result is issued, and the terminal of the vehicle-mounted end receives the voice endpoint detection result corresponding to the voice data stream issued by the server, so that accurate detection processing of the voice endpoint of the vehicle-mounted end is achieved.

In one embodiment, the voice training data includes noise-free voice data, in-vehicle environment voice data, indoor noise voice data, and abnormal voice data; the voice endpoint detection model is obtained by the following steps: determining a training data set according to the noise-free voice data, the vehicle-mounted environment voice data, the indoor noise voice data and the abnormal voice data according to a first proportion, and extracting data from the training data set according to a second proportion to obtain a verification data set; and training the deep learning neural network through the training data set and the verification data set to obtain a trained voice endpoint detection model.

In this embodiment, the voice endpoint detection model is obtained by constructing a training set according to a certain proportion based on voice training data obtained in different environments and training the training set. Specifically, the voice training data includes noise-free voice data, vehicle-mounted environment voice data, indoor noise voice data, and abnormal voice data. The noiseless voice data can be voice data acquired in a noiseless environment, for example, the noiseless voice data can be acquired from a Chinese voice forecast library THCHS-30 of an open source, and the overall signal-to-noise ratio (SNR) of the noiseless voice data is high; the vehicle-mounted environment voice data is acquired in a vehicle-mounted environment, and the overall signal-to-noise ratio (SNR) is low, so that the fitting degree and robustness of the model to various vehicle-mounted noises are enhanced; the indoor noise voice data is voice data acquired in an indoor noise environment, such as voice data collected daily in research and development and QA (quality assessment); the abnormal voice data is abnormal meaningless voice data such as screaming sound.

And determining a training data set according to the noise-free voice data, the vehicle-mounted environment voice data, the indoor noise voice data and the abnormal voice data according to a first proportion. Specifically, the proportion of the noise-free voice data, the vehicle-mounted environment voice data, the indoor noise voice data and the abnormal voice data in the training data set can be determined according to a first proportion, for example, the proportion can be 3:1:5:1, and thus the training data set is constructed. The first proportion can be flexibly set according to actual requirements. After the training data set is obtained, data are extracted from the training data set according to a second proportion to obtain a verification data set, the second proportion can be flexibly set according to actual requirements, for example, the second proportion can be set to be 5% or 10%, namely 5% or 10% of data are extracted from the training data set to serve as the verification data set.

After the training data set and the verification data set are obtained, the deep learning neural network is trained based on the training data set and the verification data set, for example, feature vectors corresponding to each training data in the training data set are input, and binary probabilities, specifically, probabilities of speech probability/non-speech probability, such as [0.9120,0.0111] or [0.1234,0.8152] are output for a binary classifier, such that the judgment that the training data is a silent frame or a speech frame by the model is obtained. The deep learning neural network model may include an input layer [440,512], a hidden layer 1[512,128], a hidden layer 2[128, 64], and an output layer [64,2 ]. Wherein, the input dimension of the input layer is 440 dimensions, and the output dimension is 512 dimensions; the input of the hidden layer 1 is 512 dimensions, and the output is 128 dimensions; the input of the hidden layer 2 is 128 dimensions, and the output is 64 layers; the input of the output layer is 64 dimensions and the output is 2 dimensions.

And when the training ending condition is met, if the DNN model meets a preset loss function, ending the training to obtain a trained voice endpoint detection model. In a specific implementation, the Loss function Loss is a cross entropy function, which may be specifically represented by the following formula:

where y represents the probability distribution of the DNN prediction output, for example, output prediction vector y [ [0.8,0.2], [0.4,0.6], [0.1,0.9] ], i.e., OneHot vector whose model predicts speech frame/mute frame respectively, y 'represents the true probability distribution of machine labeling, for example, speech frame/mute frame/speech frame [1,0,1], and OneHot vector y' [ [1,0], [0,1], [1,0] ]. Specifically, the cross entropy may be implemented in a transorflow platform as cross _ entropy ═ tf.reduction _ mean (-tf.reduction _ sum (y _. tf.log (y), reduction _ indices [1 ])). During training, an AdamaOptizer optimization algorithm is adopted to reduce the Loss value, the learning rate of each parameter is dynamically adjusted by the AdamaOptizer optimization algorithm, the early learning speed is high, and the later learning speed is low. Specifically, in the transorflow platform, it can be realized by a train _ step ═ tf.train.adammoptizer (learning _ rate ═ 1e-3). And finishing the training when the Loss function Loss meets a preset training finishing condition to obtain a trained voice endpoint detection model. The voice endpoint detection model can perform voice endpoint analysis on the voice feature vectors to obtain a voice endpoint analysis result, and can further determine voice endpoints in the voice data stream based on the voice endpoint analysis result to obtain a voice endpoint detection result.

In one embodiment, determining the training data set based on the noise-free speech data, the in-vehicle environment speech data, the indoor noise speech data, and the abnormal speech data according to the first ratio comprises: determining training data from the noise-free voice data, the vehicle-mounted environment voice data, the indoor noise voice data and the abnormal voice data according to the total training data and the first proportion; and marking each training data frame in the training data, and constructing a training data set based on each marked training data frame.

In this embodiment, the training data is determined from the noise-free voice data, the vehicle-mounted environment voice data, the indoor noise voice data, and the abnormal voice data according to the total amount of the training data and the first ratio. The total amount of the training data is the size of the training data set and is the amount of the required training data. According to the first proportion and the total training data amount, data amounts corresponding to the noise-free voice data, the vehicle-mounted environment voice data, the indoor noise voice data and the abnormal voice data can be determined, and corresponding training data can be extracted from the data amounts. And after the training data are obtained, labeling each training data frame in the training data, and constructing a training data set based on each labeled training data frame. When labeling each training data frame, each training data frame may be aligned with a PDF (Probability Density Function) based on an ASR acoustic Model, such as a HMM-GMM (Hidden Markov Models-Gaussian Mixture Model )/TDNN (Time-Delay Neural Network)/LSTM (Long Short-Term Memory Network) Model under a Kaldi frame, and then each training data frame is mapped with each phoneme in the phoneme table by the PDF, and each training data frame is labeled as a silence frame or a speech frame according to the phoneme corresponding to each training data frame. In specific implementation, the silence frame may be labeled as 0, and the speech frame may be labeled as 1, so as to obtain an OneHot vector of each speech frame, for example, the labeled vector of a speech frame sequence is: [0,0,0,0,0,1,1,1,0,1,0,1,1,1,1,0,0,0,0]. The 1/0 labels of the training data frames with voice/non-voice can be used as input labels in supervised learning, and the training data set constructed based on the labeled training data frames can be used as training data in supervised learning.

In one embodiment, after obtaining the trained speech endpoint detection model, the method further includes: acquiring a test data set carrying an artificially labeled voice endpoint; performing voice endpoint analysis on the test data in the test data set through the voice endpoint detection model to obtain a voice endpoint analysis result, and determining a test voice endpoint in the test data according to the voice endpoint analysis result; and comparing the manually marked voice endpoint of the test data with the test voice endpoint according to the detection accurate judgment condition, the detection loss judgment condition and the detection error judgment condition to obtain a model detection result.

In this embodiment, after the training of the voice endpoint detection model is finished, the voice endpoint detection model may be tested through a preset test data set to determine the robustness of the voice endpoint detection model. Specifically, after a trained voice endpoint detection model is obtained, a test data set carrying a manually labeled voice endpoint is obtained. The test data set includes test data carrying manually labeled voice endpoints. The test data may include 300 pieces of indoor voice data recorded by the smart flat department QA, which has a high signal-to-noise ratio; the test data can also comprise vehicle-mounted voice data on 1000 peer lines, and the signal-to-noise ratio (SNR) of the vehicle-mounted voice data is low, so that the robustness of a test model to noise is tested. The test data is manually marked to generate the first and last point times of all audio files of the test data set, for example [ [0.15,3.15], [0.2,2.8] ].

And after the test data set is obtained, processing the test data through the trained voice endpoint detection model, and detecting the test voice endpoint corresponding to each test data through the voice endpoint detection model. Specifically, the voice endpoint detection model is used for performing voice endpoint analysis on the test data in the test data set to obtain a voice endpoint analysis result, and determining a test voice endpoint in the test data according to the voice endpoint analysis result. And comparing the manually marked voice endpoint of the test data with the test voice endpoint according to the detection accurate judgment condition, the detection loss judgment condition and the detection error judgment condition to obtain a model detection result. The detection precision judging condition is used for judging the accuracy of the tested voice endpoint compared with the manually marked voice endpoint; detecting loss judgment conditions to judge and judge the degree of invalid non-voice data introduced by the test voice endpoint; and detecting a misjudgment condition to judge whether the test voice endpoint can cut off the complete test data or not, so that the effective voice data is lost.

Specifically, as shown in FIG. 4, Start Label is the manually labeled head End in a piece of test data, and End Label is the manually labeled tail End. Regarding the initial point, taking manually marked Start Label as a standard point, detecting the accuracy that the accurate judgment condition Match represents to meet the requirement, namely predicting that the point falls in the range of 0.5s before and 0.05s after the Start Label; the detection Loss judgment condition Loss represents the accuracy which does not meet the requirement, but effective audio data cannot be lost, and the ASR can be normally identified, and only an invalid non-voice signal can be transmitted; the detection of the false determination condition Error indicates an unsatisfactory accuracy, and at the same time, the valid audio data is truncated, and the ASR may not recognize the whole sentence normally. For the tail End point, an End Label marked manually is taken as a standard point, and the Match, Loss and Error calculation mode refers to the head End point for analogy. In particular, Match for the tail End point in this application is in the range of 0.25s before and 0.5s after End Label. It can be understood that the requirement of VAD on the voice endpoint detection model is to avoid Error errors and reduce Loss as much as possible to ensure the maximum proportion of Match. By detecting the accurate judgment condition, the Loss detection judgment condition and the wrong judgment condition, comparing the manually marked voice endpoint of the test data with the test voice endpoint, the proportion of Match, Loss and Error of the trained voice endpoint detection model can be determined when the test data is processed, and the robustness of the voice endpoint detection model can be evaluated.

In one embodiment, after acquiring the current vehicle-mounted network condition of the vehicle-mounted terminal, the method further includes: and when the current vehicle-mounted network condition is determined to meet the preset model updating condition, updating the voice endpoint detection model of the vehicle-mounted end.

In this embodiment, when the current vehicle-mounted network condition of the vehicle-mounted terminal meets the model updating condition, the voice endpoint detection model of the vehicle-mounted terminal is updated, and the version of the voice endpoint detection model of the vehicle-mounted terminal is timely improved, so as to ensure the working effect of the voice endpoint detection model. Specifically, after the current vehicle-mounted network condition of the vehicle-mounted terminal is obtained, a preset model updating condition is obtained, the current vehicle-mounted network condition is compared with the model updating condition, and when the current vehicle-mounted network condition meets the model updating condition, for example, when the current network speed of the current vehicle-mounted network condition meets a model updating network speed threshold value in the model updating condition, the voice endpoint detection model of the vehicle-mounted terminal is updated. Specifically, the model file based on the Tensorflow voice endpoint detection model may be vad _ dnn _ model. When the model needs to be updated, a model updating request can be sent to the server to request the model updating data, and the voice endpoint detection model of the vehicle-mounted end is updated according to the model updating data; or sending a model updating request to the server to directly request to download the latest voice endpoint detection model, thereby realizing the online updating of the voice endpoint detection model of the vehicle-mounted end.

Fig. 2 is a schematic flow chart of a voice endpoint detection method of a vehicle-mounted terminal in one embodiment. It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

As shown in fig. 5, in one embodiment, there is provided a voice endpoint detection apparatus at a vehicle-mounted end, including:

a voice data obtaining module 502, configured to obtain a current vehicle-mounted network status of a vehicle-mounted terminal and a voice data stream to be detected;

a voice feature extraction module 504, configured to extract a voice feature vector from a voice data stream when it is determined that the current vehicle-mounted network condition does not meet a preset online detection condition;

an endpoint analysis module 506, configured to perform voice endpoint analysis on the voice feature vector through a voice endpoint detection model of the vehicle-mounted terminal to obtain a voice endpoint analysis result, where the voice endpoint detection model is obtained by training based on voice training data obtained in different environments;

and the endpoint determining module 508 is configured to determine a voice endpoint in the voice data stream according to the voice endpoint analysis result, so as to obtain a voice endpoint detection result.

In one embodiment, the speech feature extraction module 504 includes a network comparison module, a speech framing module, and a frame feature extraction module; wherein: the network comparison module is used for comparing the current vehicle-mounted network condition with a preset online detection condition; the voice framing module is used for framing the voice data stream to obtain each voice frame when the current vehicle-mounted network condition does not meet the online detection condition; and the frame characteristic extraction module is used for extracting the characteristics of each voice frame and the upper and lower related frames of the voice frame and combining the characteristics to obtain the voice characteristic vector corresponding to each voice frame.

In one embodiment, the device further comprises a data uploading module and a detection result receiving module; wherein: the data uploading module is used for uploading the voice data stream to the server when the current vehicle-mounted network condition is determined to meet the preset online detection condition, so that the server performs end point detection on the voice data stream through a voice end point detection model of the server; and the detection result receiving module is used for receiving a voice endpoint detection result corresponding to the voice data stream issued by the server.

In one embodiment, the voice training data includes noise-free voice data, in-vehicle environment voice data, indoor noise voice data, and abnormal voice data; the system also comprises a training preparation module and a model training module; wherein: the training preparation module is used for determining a training data set according to the noise-free voice data, the vehicle-mounted environment voice data, the indoor noise voice data and the abnormal voice data according to a first proportion, and extracting data from the training data set according to a second proportion to obtain a verification data set; and the model training module is used for training the deep learning neural network through the training data set and the verification data set to obtain a trained voice endpoint detection model.

In one embodiment, the training preparation module comprises a training data determination module and a data labeling module; wherein: the training data determining module is used for determining training data from the noise-free voice data, the vehicle-mounted environment voice data, the indoor noise voice data and the abnormal voice data according to the total training data and the first proportion; and the data labeling module is used for labeling each training data frame in the training data and constructing a training data set based on each labeled training data frame.

In one embodiment, the device further comprises a test data acquisition module, a test data processing module and a detection result acquisition module; wherein: the test data acquisition module is used for acquiring a test data set carrying an artificially labeled voice endpoint; the test data processing module is used for carrying out voice endpoint analysis on the test data in the test data set through the voice endpoint detection model to obtain a voice endpoint analysis result, and determining a test voice endpoint in the test data according to the voice endpoint analysis result; and the detection result obtaining module is used for comparing the manually marked voice endpoint of the test data with the test voice endpoint according to the detection accurate judgment condition, the detection loss judgment condition and the detection error judgment condition to obtain a model detection result.

In one embodiment, the vehicle-mounted terminal voice endpoint detection system further comprises a model updating module, wherein the model updating module is used for updating the voice endpoint detection model of the vehicle-mounted terminal when the current vehicle-mounted network condition is determined to meet the preset model updating condition.

FIG. 6 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the terminal 110 in fig. 1. As shown in fig. 6, the computer apparatus includes a processor, a memory, a network interface, an input device, a sound collection device, a speaker, and a display screen, which are connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and also stores a computer program, and when the computer program is executed by a processor, the processor can realize the voice endpoint detection method of the vehicle-mounted terminal. The internal memory may also store a computer program, and when the computer program is executed by the processor, the processor may execute the voice endpoint detection method of the vehicle-mounted terminal. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like. The sound collection device is used for acquiring a voice data stream to be detected, and specifically can be a microphone. The speaker may play voice data for voice interaction.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the voice endpoint detection apparatus at the vehicle end provided by the present application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in fig. 6. The memory of the computer device may store various program modules constituting the voice endpoint detection apparatus of the vehicle-mounted terminal, such as a voice data acquisition module, a voice feature extraction module, an endpoint analysis module, and an endpoint determination module shown in fig. 5. The computer program constituted by the respective program modules causes the processor to execute the steps in the voice endpoint detection method of the vehicle-mounted terminal according to the respective embodiments of the present application described in the present specification.

For example, the computer device shown in fig. 6 may perform acquiring the current vehicle-mounted network condition of the vehicle-mounted terminal and the voice data stream to be detected through a voice data acquisition module in the voice endpoint detection apparatus of the vehicle-mounted terminal shown in fig. 5. The computer equipment can extract the voice feature vector from the voice data stream when the current vehicle-mounted network condition is determined not to meet the preset online detection condition through the voice feature extraction module. The computer equipment can perform voice endpoint analysis on the voice characteristic vectors through a voice endpoint detection model of the vehicle-mounted end through an endpoint analysis module to obtain a voice endpoint analysis result, and the voice endpoint detection model is obtained through training based on voice training data obtained in different environments. The computer equipment can determine the voice endpoint in the voice data stream according to the voice endpoint analysis result through the endpoint determination module to obtain a voice endpoint detection result.

In one embodiment, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to execute the steps of the voice endpoint detection method of the vehicle-mounted terminal. Here, the steps of the voice endpoint detection method of the vehicle-mounted terminal may be the steps in the voice endpoint detection method of the vehicle-mounted terminal of the above embodiments.

In one embodiment, a computer-readable storage medium is provided, which stores a computer program, and when the computer program is executed by a processor, the processor is caused to execute the steps of the voice endpoint detection method of the vehicle-mounted terminal. Here, the steps of the voice endpoint detection method of the vehicle-mounted terminal may be the steps in the voice endpoint detection method of the vehicle-mounted terminal of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A voice endpoint detection method of a vehicle-mounted end is characterized by comprising the following steps:

when the current vehicle network speed included in the current vehicle-mounted network condition is determined not to exceed a preset network speed threshold value, combining to obtain a voice feature vector corresponding to each voice frame according to the feature of each voice frame in the voice data stream and the feature of upper and lower related frames of each voice frame; the dimensionality of the voice feature vector is the sum of the dimensionality of the feature of the corresponding voice frame and the dimensionality of the feature of the upper and lower associated frames;

performing voice endpoint analysis on the voice feature vector through a voice endpoint detection model of the vehicle-mounted end to obtain a voice endpoint analysis result, wherein the voice endpoint detection model is obtained by training based on voice training data obtained in different environments; the voice training data comprises noise-free voice data, vehicle-mounted environment voice data, indoor noise voice data and abnormal voice data which are in a preset number ratio; the training step of the voice endpoint detection model comprises the following steps: performing voice endpoint analysis on the voice training data through a voice endpoint detection model to be trained, performing parameter adjustment on the voice endpoint detection model to be trained according to an obtained training data analysis result and a labeling result of the voice training data, and then continuing training, and obtaining a trained voice endpoint detection model when a training end condition is met; performing model detection on the trained voice endpoint detection model based on a detection accurate judgment condition, a detection loss judgment condition and a detection error judgment condition, and obtaining the voice endpoint detection model after the detection is passed;

determining a voice endpoint in the voice data stream according to the voice endpoint analysis result to obtain a voice endpoint detection result;

and when the current vehicle-mounted network condition is determined to meet the preset model updating condition, updating the voice endpoint detection model of the vehicle-mounted terminal.

2. The method according to claim 1, wherein when it is determined that a current vehicle network speed included in the current vehicle-mounted network condition does not exceed a preset network speed threshold, combining to obtain a speech feature vector corresponding to each speech frame according to features of each speech frame and upper and lower associated frames of each speech frame in the speech data stream, comprises:

comparing the current vehicle network speed included in the current vehicle-mounted network condition with a preset network speed threshold value;

when the current vehicle network speed does not exceed the network speed threshold value, framing the voice data stream to obtain each voice frame;

and extracting the characteristics of each voice frame and the upper and lower related frames of the voice frame, and combining to obtain a voice characteristic vector corresponding to each voice frame.

3. The method according to claim 1, after the obtaining of the current vehicle-mounted network condition of the vehicle-mounted terminal and the voice data stream to be detected, further comprising:

when the current vehicle-mounted network condition is determined to meet a preset online detection condition, uploading the voice data stream to a server, so that the server performs end point detection on the voice data stream through a voice end point detection model of a server end;

and receiving a voice endpoint detection result corresponding to the voice data stream issued by the server.

4. The method of claim 1, wherein the speech endpoint detection model is obtained by:

determining a training data set according to the noise-free voice data, the vehicle-mounted environment voice data, the indoor noise voice data and the abnormal voice data according to a first proportion, and extracting data from the training data set according to a second proportion to obtain a verification data set;

and training the deep learning neural network through the training data set and the verification data set to obtain a trained voice endpoint detection model.

5. The method of claim 4, wherein determining a training data set based on the noise-free speech data, the on-board ambient speech data, the indoor noise speech data, and the abnormal speech data according to the first scale comprises:

determining training data from the noise-free voice data, the vehicle-mounted environment voice data, the indoor noise voice data and the abnormal voice data according to a total training data amount and a first proportion;

and marking each training data frame in the training data, and constructing a training data set based on each marked training data frame.

6. The method of claim 4, wherein after obtaining the trained speech endpoint detection model, further comprising:

acquiring a test data set carrying an artificially labeled voice endpoint;

performing voice endpoint analysis on the test data in the test data set through the voice endpoint detection model to obtain a voice endpoint analysis result, and determining a test voice endpoint in the test data according to the voice endpoint analysis result;

and comparing the manually marked voice endpoint of the test data with the test voice endpoint according to the detection accurate judgment condition, the detection loss judgment condition and the detection error judgment condition to obtain a model detection result.

7. The method according to any one of claims 1 to 6, wherein when it is determined that the current vehicle-mounted network condition satisfies a preset model update condition, updating the voice endpoint detection model of the vehicle-mounted terminal comprises:

when the current vehicle-mounted network condition is determined to meet a preset model updating condition, sending a model updating request to a server;

and updating the voice endpoint detection model of the vehicle-mounted end according to model updating data returned by the server based on the model updating request.

8. An apparatus for detecting a voice endpoint at a vehicle-mounted terminal, the apparatus comprising:

a voice feature extraction module, configured to, when it is determined that a current vehicle network speed included in the current vehicle-mounted network condition does not exceed a preset network speed threshold, combine features of each voice frame in the voice data stream and features of upper and lower associated frames of each voice frame to obtain a voice feature vector corresponding to each voice frame; the dimensionality of the voice feature vector is the sum of the dimensionality of the feature of the corresponding voice frame and the dimensionality of the feature of the upper and lower associated frames;

the endpoint analysis module is used for carrying out voice endpoint analysis on the voice feature vector through a voice endpoint detection model of the vehicle-mounted end to obtain a voice endpoint analysis result, and the voice endpoint detection model is obtained by training based on voice training data obtained in different environments; the voice training data comprises noise-free voice data, vehicle-mounted environment voice data, indoor noise voice data and abnormal voice data which are in a preset number ratio; the training step of the voice endpoint detection model comprises the following steps: performing voice endpoint analysis on the voice training data through a voice endpoint detection model to be trained, performing parameter adjustment on the voice endpoint detection model to be trained according to an obtained training data analysis result and a labeling result of the voice training data, and then continuing training, and obtaining a trained voice endpoint detection model when a training end condition is met; performing model detection on the trained voice endpoint detection model based on a detection accurate judgment condition, a detection loss judgment condition and a detection error judgment condition, and obtaining the voice endpoint detection model after the detection is passed;

an endpoint determining module, configured to determine a voice endpoint in the voice data stream according to the voice endpoint analysis result, to obtain a voice endpoint detection result;

and the model updating module is used for updating the voice endpoint detection model of the vehicle-mounted terminal when the current vehicle-mounted network condition is determined to meet the preset model updating condition.

9. The apparatus of claim 8, wherein the speech feature extraction module comprises:

the network comparison module is used for comparing the current vehicle network speed included in the current vehicle-mounted network condition with a preset network speed threshold value;

the voice framing module is used for framing the voice data stream to obtain each voice frame when the current vehicle network speed does not exceed the network speed threshold;

and the frame feature extraction module is used for extracting the features of each voice frame and the upper and lower related frames of the voice frame and combining the features to obtain the voice feature vector corresponding to each voice frame.

10. The apparatus of claim 8, further comprising:

the data uploading module is used for uploading the voice data stream to a server when the current vehicle-mounted network condition is determined to meet a preset online detection condition, so that the server performs end point detection on the voice data stream through a voice end point detection model of a server end;

and the detection result receiving module is used for receiving a voice endpoint detection result corresponding to the voice data stream issued by the server.

11. The apparatus of claim 8, further comprising:

the training preparation module is used for determining a training data set according to the noise-free voice data, the vehicle-mounted environment voice data, the indoor noise voice data and the abnormal voice data according to a first proportion, and extracting data from the training data set according to a second proportion to obtain a verification data set;

and the model training module is used for training the deep learning neural network through the training data set and the verification data set to obtain a trained voice endpoint detection model.

12. The apparatus of claim 11, wherein the training preparation module comprises:

a training data determining module, configured to determine training data from the noise-free speech data, the vehicle-mounted environment speech data, the indoor noise speech data, and the abnormal speech data according to a total training data amount and a first ratio;

and the data labeling module is used for labeling each training data frame in the training data and constructing a training data set based on each labeled training data frame.

13. The apparatus of claim 11, further comprising:

the test data acquisition module is used for acquiring a test data set carrying an artificially labeled voice endpoint;

the test data processing module is used for carrying out voice endpoint analysis on the test data in the test data set through the voice endpoint detection model to obtain a voice endpoint analysis result, and determining a test voice endpoint in the test data according to the voice endpoint analysis result;

and the detection result obtaining module is used for comparing the manually marked voice endpoint of the test data with the test voice endpoint according to a detection accurate judgment condition, a detection loss judgment condition and a detection error judgment condition to obtain a model detection result.

14. The apparatus of any one of claims 8 to 13,

the model updating module is also used for sending a model updating request to a server when the current vehicle-mounted network condition is determined to meet the preset model updating condition; and updating the voice endpoint detection model of the vehicle-mounted end according to model updating data returned by the server based on the model updating request.

15. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 7.

16. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 7.