CN112863496A

CN112863496A - Voice endpoint detection method and device

Info

Publication number: CN112863496A
Application number: CN201911181820.4A
Authority: CN
Inventors: 袁斌
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2021-05-28
Anticipated expiration: 2039-11-27
Also published as: CN112863496B

Abstract

The application discloses a voice endpoint detection method and a device, wherein the method comprises the following steps: obtaining target voice data; obtaining an intermediate voice recognition result generated after the voice recognition decoder carries out voice recognition on the target voice data; decoding the target voice data based on the voice rear end point discrimination model to obtain a target voice unit sequence; adjusting the endpoint detection parameters after the voice according to the intermediate voice recognition result to obtain target detection parameters; and judging the rear end point of the target voice unit sequence according to the target detection parameters to obtain a voice rear end point judgment result. By using the method, the voice rear end point detection parameters can be adjusted in real time based on the intermediate voice recognition result output in the voice recognition process, the dynamic detection of the voice rear end point is realized, and the limitation caused by the fact that the existing voice end point detection process depends too much on the voice recognition result is avoided.

Description

Voice endpoint detection method and device

Technical Field

The application relates to the technical field of computers, in particular to a voice endpoint detection method. The application also relates to a voice endpoint detection device and an electronic device.

Background

Voice endpoint refers to a critical point of silence and effective Voice signal change, Voice endpoint Detection (VAD) is also called Voice Activity Detection or Voice boundary Detection, and aims to identify and eliminate a long-time silence period from a Voice signal and determine a start point and an end point of Voice, and whether the Voice endpoint Detection is correct or not has a great influence on the performance of Voice identification. Especially in human-computer interaction type scene applications, the effect of endpoint detection directly affects user experience.

For example, in the voice learning software, the user performs endpoint detection during recording evaluation, and automatically stops recording when the voice is detected to be over, so that the complicated operation of clicking a recording stop button by the user is omitted, and the user experience can be improved. For another example, in some recording scenarios, it is necessary for the user to stop recording after reading the complete text content, and if the user stays for a long time after reading half of the text content, the existing endpoint detection technology may determine a voice endpoint according to the detected early mute, and stop recording, which may not meet the predetermined recording requirement, and reduce the user experience.

The existing voice endpoint detection method is mainly realized based on a voice recognition decoder, and voice endpoint detection is carried out while voice recognition is carried out on input voice data, however, in the implementation process of the voice endpoint detection method, the voice endpoint detection effect depends too much on the voice recognition result, for example, when a label (Token) of the voice recognition decoder goes to a mark for identifying voice ending, a corresponding recognition result needs to be obtained, namely, voice endpoint judgment is carried out according to backtracking information carried on a state node corresponding to the mark, if the voice recognition decoder does not have the recognition result, the voice endpoint judgment cannot be carried out, so that the voice endpoint detection process has limitation because the voice recognition result depends too much.

Disclosure of Invention

The embodiment of the application provides a voice endpoint detection method, which aims to solve the problem that the existing voice endpoint detection process has limitation due to over dependence on a voice recognition result. The application further provides a voice endpoint detection device and an electronic device.

The embodiment of the application provides a voice endpoint detection method, which comprises the following steps:

obtaining target voice data;

obtaining an intermediate voice recognition result generated after a voice recognition decoder carries out voice recognition on the target voice data;

decoding the target voice data based on a voice rear end point discrimination model to obtain a target voice unit sequence;

adjusting the voice rear end point detection parameters according to the intermediate voice recognition result to obtain target detection parameters;

and judging the rear end point of the target voice unit sequence according to the target detection parameters to obtain a voice rear end point judgment result.

Optionally, the decoding the target voice data based on the voice rear endpoint discrimination model to obtain a target voice unit sequence includes:

mapping a context-dependent modeling unit in a speech recognition decoder to a context-independent modeling unit;

establishing a voice rear endpoint distinguishing model based on the context-irrelevant modeling unit;

and identifying the target voice data based on the voice rear end point discrimination model to obtain a target voice unit sequence.

Optionally, the modeling unit is a phoneme, and the mapping the context-dependent modeling unit in the speech recognition decoder to the context-independent modeling unit includes:

obtaining a left-related phoneme and a right-related phoneme of each target phoneme in the speech recognition decoder;

obtaining a state transition probability value of the target phoneme and a probability value of each state output observation sequence corresponding to the target phoneme, obtaining a state transition probability value of the left relevant phoneme and a probability value of each state output observation sequence corresponding to the left relevant phoneme, and obtaining a state transition probability value of the right relevant phoneme and a probability value of each state output observation sequence corresponding to the right relevant phoneme;

averaging the state transition probability value of the target phoneme, the state transition probability value of the left related phoneme and the state transition probability value of the right related phoneme to obtain a state transition probability average value; calculating the average value of the probability value of each state output observation sequence corresponding to the target phoneme, the probability value of each state output observation sequence corresponding to the left related phoneme and the probability value of each state output observation sequence corresponding to the right related phoneme to obtain the probability average value of the state output observation sequences;

and determining the state transition probability mean value as a target state transition probability value of the target phoneme, and determining the probability mean value of the state output observation sequence as a probability value of the target state output observation sequence of the target phoneme.

Optionally, the recognizing the target speech data based on the speech rear end point discrimination model to obtain a target speech unit sequence includes: obtaining a target observation sequence of the target voice data; sequentially calculating the probability value of the output target observation sequence in a recursion manner according to the sequence of the target observation sequence appearing in the voice rear end point distinguishing model; decoding the target voice data by using the voice rear endpoint distinguishing model by adopting a token passing algorithm to obtain a target state path corresponding to the maximum probability value of an output target observation sequence; and determining the voice unit sequence corresponding to the target state path as the target voice unit sequence.

Optionally, the decoding, by using a token passing algorithm, the target speech data using the speech rear endpoint recognition model includes: preprocessing the target voice data to obtain an audio frame; extracting the characteristics of the audio frame to obtain target audio characteristics; and inputting the target audio features into the voice rear endpoint distinguishing model, and decoding by adopting a token passing algorithm.

Optionally, the adjusting, according to the intermediate speech recognition result, the speech rear end point detection parameter to obtain the target detection parameter includes: adjusting the mute detection time of the voice rear end point according to the intermediate voice recognition result to obtain target mute detection time;

correspondingly, the judging the rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judging result includes: and judging the rear end point of the target voice unit sequence according to the target silence detection time to obtain a voice rear end point judgment result.

Optionally, the adjusting, according to the intermediate speech recognition result, the silence detection time of the speech rear end point to obtain a target silence detection time includes: and if the intermediate voice recognition result is not changed in a first preset time period, shortening the mute detection time of the voice rear end point, and obtaining the target mute detection time for judging the voice rear end point.

Optionally, the method further includes: if the intermediate voice recognition result is not changed within a second preset time period, performing semantic recognition on target voice data to obtain a target semantic recognition result; judging whether the target semantic recognition result is matched with preset target semantic information for judging the voice rear end point;

the decoding the target voice data based on the voice rear endpoint discrimination model to obtain a target voice unit sequence comprises the following steps: and if the target semantic recognition result is not matched with the target semantic information for judging the voice rear end point, decoding the target voice data based on the voice rear end point judging model to obtain a target voice unit sequence.

Optionally, the method further includes: and if the target semantic recognition result is matched with the preset target semantic information for judging the voice rear end point, determining that the current time point is the voice rear end point of the target voice data.

Optionally, the obtaining the target voice data includes: and if the voice input end of the target voice data is not detected to obtain the voice rear end point of the target voice data, receiving the target voice data sent by the voice input end.

Optionally, the method further includes: and if the voice rear end point judgment result shows that the voice rear end point of the target voice data is detected, outputting identification information corresponding to the voice rear end point.

Optionally, the method further includes: outputting voice rear end point approval information, wherein the voice rear end point approval information is used for a user to confirm whether the voice rear end point is a real voice rear end point; and obtaining feedback information of the user aiming at the voice post-endpoint approval information.

The embodiment of the present application further provides a voice endpoint detection method, including:

after a client does not detect a rear voice endpoint for obtaining target voice data, receiving the target voice data sent by the client; obtaining an intermediate voice recognition result generated after a voice recognition decoder carries out voice recognition on the target voice data; decoding the target voice data based on a voice rear end point discrimination model to obtain a target voice unit sequence; adjusting the voice rear end point detection parameters according to the intermediate voice recognition result to obtain target detection parameters; judging the rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judgment result; and sending the voice rear end point judgment result to the client.

Optionally, the sending the voice rear end point determination result to the client includes: and if the voice rear end point judgment result shows that the voice rear end point of the target voice data is detected, sending the voice rear end point judgment result to the client.

An embodiment of the present application further provides a voice endpoint detection system, including: the system comprises a first voice endpoint detection module, a semantic detection module and a second voice endpoint detection module;

the first voice endpoint detection module is used for carrying out voice back endpoint detection on target voice data through a voice recognition decoder, and sending the target voice data to the semantic detection module after the voice back endpoint of the target voice data is not detected;

the semantic detection module is used for carrying out semantic recognition on the target voice data to obtain a target semantic recognition result; judging whether the target semantic recognition result is matched with preset target semantic information for judging the voice rear end point; if the target semantic recognition result is matched with preset target semantic information for judging a rear voice endpoint, determining that the current time point is the rear voice endpoint of the target voice data; if the target semantic recognition result is not matched with the preset target semantic information for judging the voice rear end point, the target voice data is sent to the second voice end point detection module;

the second voice endpoint detection module is used for obtaining an intermediate voice recognition result generated after the voice recognition decoder carries out voice recognition on the target voice data; decoding the target voice data based on a voice rear end point discrimination model to obtain a target voice unit sequence; adjusting the voice rear end point detection parameters according to the intermediate voice recognition result to obtain target detection parameters; and judging the rear end point of the target voice unit sequence according to the target detection parameters to obtain a voice rear end point judgment result.

Optionally, the first voice endpoint detection module is disposed at the client, and the semantic detection module and the second voice endpoint detection module are disposed at the server.

An embodiment of the present application further provides a voice endpoint detection system, including: the system comprises a target voice data distribution module, a first voice endpoint detection module, a semantic detection module, a second voice endpoint detection module and a voice rear endpoint confirmation module;

the target voice data distribution module is used for distributing target voice data to the first voice endpoint detection module, the semantic detection module and the second voice endpoint detection module;

the first voice endpoint detection module is used for carrying out voice rear endpoint detection on the target voice data through a voice recognition decoder to obtain a first voice rear endpoint judgment result;

the semantic detection module is used for carrying out semantic recognition on the target voice data to obtain a target semantic recognition result; judging whether the target semantic recognition result is matched with preset target semantic information for judging the voice rear end point or not to obtain a semantic matching result; obtaining a second voice rear end point judgment result according to the semantic matching result;

the second voice endpoint detection module is used for obtaining an intermediate voice recognition result generated after the voice recognition decoder carries out voice recognition on the target voice data; decoding the target voice data based on a voice rear end point discrimination model to obtain a target voice unit sequence; adjusting the voice rear end point detection parameters according to the intermediate voice recognition result to obtain target detection parameters; judging the rear end point of the target voice unit sequence according to the target detection parameter to obtain a third voice rear end point judgment result;

the voice rear end point confirmation module is used for confirming a target voice rear end point judgment result aiming at the target voice data according to at least two judgment results of the first voice rear end point judgment result, the second voice rear end point judgment result and the third voice rear end point judgment result.

Optionally, the determining a target speech rear end point discrimination result for the target speech data according to at least two discrimination results of the first speech rear end point discrimination result, the second speech rear end point discrimination result, and the third speech rear end point discrimination result includes: and determining the firstly obtained discrimination result as a target voice rear end point discrimination result aiming at the target voice data according to the time sequence of at least two discrimination results of the first voice rear end point discrimination result, the second voice rear end point discrimination result or the third voice rear end point discrimination result.

An embodiment of the present application further provides a device for detecting an endpoint after a voice event, including:

a target voice data obtaining unit for obtaining target voice data;

an intermediate speech recognition result obtaining unit, configured to obtain an intermediate speech recognition result generated by a speech recognition decoder performing speech recognition on the target speech data;

a target voice unit sequence obtaining unit, configured to decode the target voice data based on a voice rear end point discrimination model to obtain a target voice unit sequence;

a target silence detection time obtaining unit, configured to adjust a voice rear-end detection parameter according to the intermediate voice recognition result, and obtain a target detection parameter;

and the voice rear end point judging unit is used for judging the voice rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judging result.

An embodiment of the present application further provides an electronic device, including: a processor and a memory for storing a voice backend point detection program, which when read executed by the processor performs the following operations: obtaining target voice data; obtaining an intermediate voice recognition result generated after a voice recognition decoder carries out voice recognition on the target voice data; decoding the target voice data based on a voice rear end point discrimination model to obtain a target voice unit sequence; adjusting the voice rear end point detection parameters according to the intermediate voice recognition result to obtain target detection parameters; and judging the voice rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judgment result.

Compared with the prior art, the embodiment of the application has the following advantages:

the voice endpoint detection method provided by the embodiment of the application decodes the target voice data by using the voice rear endpoint discrimination model to obtain a target voice unit sequence, adjusting the voice rear end point detection parameters according to the intermediate voice recognition result, and judging the rear end point of the target voice unit sequence according to the adjusted detection parameters to obtain a voice rear end point judgment result, the method synchronously carries out the voice rear end point detection process and the voice recognition process, carries out decoupling processing on the voice end point detection process and the voice recognition process through an end point detection link provided by the method, combines the voice rear end point detection process with the current voice application scene, adjusts the voice rear end point detection parameters in real time based on the intermediate voice recognition result output in the voice recognition process, realizes dynamic detection of the voice rear end point, and avoids the problem that the existing voice end point detection process has limitation due to over dependence on the voice recognition result.

Drawings

Fig. 1 is a flowchart of a voice endpoint detection method according to a first embodiment of the present application;

fig. 2 is a flowchart of a voice endpoint detection method according to a second embodiment of the present application;

FIG. 3 is a schematic diagram of a voice endpoint detection system according to a third embodiment of the present application;

FIG. 4 is a schematic diagram of a voice endpoint detection system according to a fourth embodiment of the present application;

fig. 5 is a block diagram of a voice endpoint detection apparatus according to a fifth embodiment of the present application;

fig. 6 is a schematic logical structure diagram of an electronic device according to a sixth embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

Aiming at a voice endpoint detection scene, in order to improve the applicability of voice endpoint detection, the application provides a voice endpoint detection method, a voice endpoint detection device corresponding to the method and electronic equipment. The following provides embodiments to explain the method, apparatus, and electronic device in detail.

A first embodiment of the present application provides a voice endpoint detection method, an application body of the method may be a computing device application for performing voice endpoint detection, fig. 1 is a flowchart of the voice endpoint detection method provided in the first embodiment of the present application, and the method provided in this embodiment is described in detail below with reference to fig. 1. The following description refers to embodiments for the purpose of illustrating the principles of the methods, and is not intended to be limiting in actual use.

As shown in fig. 1, the voice endpoint detection method provided in this embodiment includes the following steps:

s101, obtaining target voice data.

The target voice data refers to voice data to be subjected to voice rear end point detection, such as navigation voice of a driver in a vehicle-mounted environment.

The obtaining of the target voice data may be receiving the target voice data sent by the voice input end after the voice input end of the target voice data does not detect the rear end point of the obtained target voice data. The voice input end can refer to an intelligent mobile terminal used by a user or a terminal arranged locally to a sound-producing user, such as a vehicle-mounted navigation device.

S102, obtaining an intermediate voice recognition result generated after the voice recognition decoder carries out voice recognition on the target voice data.

In the present embodiment, the process of performing speech recognition on the target speech data is performed in synchronization with the process of speech endpoint detection. The realization of the method needs to be realized based on a voice recognition decoder, namely, after the intermediate voice recognition result of the voice recognition decoder is obtained, the rear end point detection of the target voice data can be carried out. In this embodiment, the intermediate speech recognition result is preferably the number of silent frames of the intermediate speech.

S103, decoding the target voice data based on the voice rear end point discrimination model to obtain a target voice unit sequence.

The implementation process of the step is different from the decoding process of the existing speech recognition decoder, and the operation is simplified on the basis of the decoding process of the existing speech recognition decoder. In this step, the target speech data is decoded based on a speech rear end point discrimination model, a modeling unit of the speech rear end point discrimination model is a decoding unit obtained by mapping a modeling unit of the speech recognition decoder, that is, a context-dependent modeling unit is mapped to a context-independent modeling unit, and the number of state nodes corresponding to the context-independent modeling unit is greatly less than the number of state nodes corresponding to the context-dependent modeling unit, so that the mapping aims to reduce the number of state nodes of the model, thereby reducing the calculation amount in the decoding process. The reason for this is that: compared with the process of voice recognition, the process of voice endpoint detection for the target voice data has lower accuracy requirement on the recognition result, namely, the voice endpoint detection process only needs to judge the voice and the silence, but has no requirement on the specific content of the voice, so the path searching process can be relatively simple.

In this embodiment, the process of decoding the target speech data based on the speech rear-end point discrimination model to obtain the target speech unit sequence may include the following steps:

firstly, mapping a context-dependent modeling unit in the speech recognition decoder into a context-independent modeling unit;

secondly, obtaining the voice rear end point distinguishing model based on the context-irrelevant modeling unit;

and finally, identifying the target voice data based on the voice rear end point distinguishing model to obtain a target voice unit sequence.

The foregoing mapping the context-dependent modeling unit in the speech recognition decoder as the context-independent modeling unit refers to simplified mapping of the modeling unit in the acoustic model of the speech recognition decoder, so as to reduce the number of state nodes in the model, where the acoustic model is preferably a hidden markov network (HMM) model, in this embodiment, the context-dependent modeling unit in the speech recognition decoder may be a phoneme, and correspondingly, the simplified mapped context-independent modeling unit is also a phoneme, and the process specifically includes the following steps:

firstly, obtaining a left relevant phoneme and a right relevant phoneme of each target phoneme in the speech recognition decoder; here, the left related phoneme and the right related phoneme are both context related phonemes of the target phoneme. For example, the acoustic model of the speech recognition decoder includes 26 phonemes (modeling units), and the 26 phonemes are all the target phonemes.

Secondly, obtaining a state transition probability value of the target phoneme and a probability value of each state output observation sequence corresponding to the target phoneme, obtaining a state transition probability value of the left relevant phoneme and a probability value of each state output observation sequence corresponding to the left relevant phoneme, and obtaining a state transition probability value of the right relevant phoneme and a probability value of each state output observation sequence corresponding to the right relevant phoneme;

then, averaging the state transition probability value of the target phoneme, the state transition probability value of the left related phoneme and the state transition probability value of the right related phoneme to obtain a state transition probability average value, and averaging the probability value of each state output observation sequence corresponding to the target phoneme, the probability value of each state output observation sequence corresponding to the left related phoneme and the probability value of each state output observation sequence corresponding to the right related phoneme to obtain a probability average value of the state output observation sequences;

and finally, determining the state transition probability mean value as a target state transition probability value of the target phoneme, and determining the probability mean value of the state output observation sequence as a probability value of the target state output observation sequence of the target phoneme.

After the mapping, each modeling unit only corresponds to one state node, namely, a phoneme is taken as a node of an acoustic model of the voice rear end point distinguishing model, and the target state transition probability value of the target phoneme and the probability value of the target state output observation sequence are probability values corresponding to the network node. For example, after mapping the 26 phonemes (modeling units), the obtained speech rear end point discrimination model is a hidden markov network (HMM) model including 26 state nodes.

In this embodiment, the identifying the target speech data based on the speech rear-end point discrimination model to obtain a target speech unit sequence may specifically be: obtaining a target observation sequence corresponding to the target voice data; sequentially calculating the probability value of the output target observation sequence in a recursion manner according to the sequence of the target observation sequence appearing in the voice rear end point distinguishing model; decoding the target voice data by using the voice rear endpoint distinguishing model by adopting a token passing algorithm to obtain a target state path corresponding to the maximum probability value of an output target observation sequence; and determining the voice unit sequence corresponding to the target state path as the target voice unit sequence.

The process of decoding the target speech data by using the speech rear endpoint discrimination model by using the token passing algorithm specifically includes the following steps:

firstly, preprocessing the target voice data to obtain an audio frame; the process specifically comprises the following steps: the analog speech signal is converted into a digital signal by a/D conversion, the digital signal is subjected to pre-emphasis processing to raise the high frequency portion of the digital signal, and then frame windowing processing is performed to divide the non-stationary speech signal data points into short-time signals in units of audio frames.

Secondly, extracting the characteristics of the audio frame to obtain target audio characteristics; in this embodiment, the audio features are MFCC speech features, and the process of extracting the features is to convert the sound signal into Mel frequencies.

And finally, inputting the target audio features into the voice rear endpoint distinguishing model, and decoding by adopting a token passing algorithm. The maximum probability path of each state node in the voice rear endpoint distinguishing model is recorded in the corresponding model node, and the maximum probability path arc-entering node is recorded in the model node variable. When the transmission is terminated, whether the path is a priority path is judged by calculating the maximum likelihood value (probability value) of the observation sequence of each state path generating target voice data according to the backtracking information stored in the token, and the greater the log likelihood value is, the more likely the path is the priority path. The log-likelihood value of a path is equal to the sum of the log-likelihood values (hop probability values) of all the hop arcs that the path passes through plus the log-probability densities (observation sequence output probability values) of the observation sequences associated with the node states of all the HMMs that it passes through. The token passing algorithm is adopted for decoding, redundancy in a search network of the voice rear end point discrimination model is reduced, space complexity and expenditure of computing resources are greatly reduced, and decoding efficiency of the voice rear end point discrimination model is improved.

The execution sequence of steps S102 and S103 is not limited, and that is, after the target speech data is decoded based on the speech rear end point discrimination model and the target speech unit sequence is obtained, the intermediate speech recognition result generated by the speech recognition decoder performing speech recognition on the target speech data may be obtained.

And S104, adjusting the voice rear end point detection parameters according to the intermediate voice recognition result to obtain target detection parameters.

After the intermediate voice recognition result generated after the voice recognition decoder performs voice recognition on the target voice data and the target voice unit sequence obtained after the target voice data is decoded based on the voice rear end point distinguishing model are obtained in the above steps, the step is used for adjusting the voice rear end point detection parameter corresponding to the target voice unit sequence according to the intermediate voice recognition result to obtain the target detection parameter.

In this embodiment, adjusting the rear-end detection parameter according to the intermediate speech recognition result to obtain the target detection parameter may specifically refer to: and adjusting the silence detection time of the voice rear end point according to the intermediate voice recognition result to obtain the target silence detection time, for example, if the intermediate voice recognition result is not changed in a first preset time period, shortening the silence detection time of the voice rear end point to obtain the target silence detection time for judging the voice rear end point. For example, the preset silence detection time of the voice backend point is 800ms, and when the intermediate voice recognition result is not changed within 300ms, the silence detection time of the voice backend point is shortened to 600 ms. And the mute detection time of the voice rear end point is adjusted according to the intermediate voice recognition result, so that the voice rear end point detection process can be combined with the applicable scene of the current voice in real time, and the dynamic detection of the voice rear end point is realized. For example, the speaking mode of the user a is that the speaking mode is habitually paused for a period of time in the middle of speaking, therefore, when the voice rear end point detection is performed on the voice information of the user a, the mute detection time can be adjusted in real time according to the pause time in the middle of speaking, and if the pause time is longer, the mute detection time is shortened, so that the voice rear end point can be quickly determined.

It should be noted that, if the intermediate speech recognition result changes within the time interval from 300ms to 600ms, the silence detection time of the speech rear-end point can be restored to 800ms, which is the reason why: if the intermediate voice recognition result changes in the time interval of 300ms to 600ms, the voice input is not finished, and the mute detection time of the voice rear end point is recovered to the preset mute detection time, so that the situation that the voice rear end point is determined early can be avoided. In this embodiment, after obtaining an intermediate speech recognition result generated after the speech recognition decoder performs speech recognition on the target speech data, the following contents are further included:

A. and if the intermediate voice recognition result is not changed in a second preset time period, performing semantic recognition on the target voice data to obtain a target semantic recognition result. Specifically, semantic analysis is performed on the result recognized by the speech recognition decoder, and specifically, semantic analysis may be performed by a Natural Language Understanding (NLU) technology. The second predetermined period of time is used here to distinguish from the first predetermined period of time described above, and is used to indicate that the function of the intermediate speech recognition result is different in the two scenarios.

In this embodiment, after the intermediate speech recognition result is not changed within the second predetermined time period, the target speech data is subjected to semantic recognition, rather than semantic parsing frame by frame, so as to save computation amount, and when the intermediate speech recognition result of the speech recognition decoder is not changed within a predetermined time limit (it can be considered that the result at this time is a stable result), the recognition result of the speech recognition decoder is subjected to semantic parsing through a Natural Language Understanding (NLU) technique.

B. Judging whether the target semantic recognition result is matched with preset target semantic information for judging the voice rear end point; for example, the target semantic information for discriminating the voice back end point may be a navigation command word.

C. If the target semantic recognition result does not match with the predetermined target semantic information for distinguishing the voice back end point, the operation of step S103 is executed, that is, the target voice data is decoded based on the voice back end point distinguishing model, so as to obtain a target voice unit sequence.

D. And if the target semantic recognition result is matched with preset target semantic information for judging the voice rear end point, determining that the current time point is the voice rear end point of the target voice data. For example, for a preset navigation command word, if the semantic meaning of the navigation command word is detected, the operation of "decoding the target voice data based on the voice back end point discrimination model to obtain the target voice unit sequence" may not be performed, and the voice back end point is directly determined, so as to reduce the delay in voice interaction. For the vehicle-mounted scene, after a user speaks an offline command word, the user only needs to wait whether a predefined command word is hit by the semantic analysis module, and if the predefined command word is hit, the user immediately returns a voice end point event.

And S105, judging the voice rear end point of the target voice unit sequence according to the target detection parameters to obtain a voice rear end point judgment result.

After the target detection parameters are obtained in the above steps, the step is used for judging the voice back end point of the target voice unit sequence according to the target detection parameters, and obtaining a voice back end point judgment result. For example, the voice rear end point of the target voice unit sequence is judged according to the target silence detection time, and when the time corresponding to the number of the silence frames of the target voice unit sequence reaches the target silence detection time, the current time point is determined as the voice rear end point of the target voice unit sequence.

In this embodiment, if the result of the determination of the rear end point of the speech indicates that the rear end point of the speech of the target speech data has been detected, the identification information corresponding to the rear end point of the speech is output, for example, by displaying a punctuation mark at a corresponding position of a text output interface corresponding to the target speech data, and performing sentence-breaking processing on the target speech data through the punctuation mark, so as to visually display the rear end point of the speech. Meanwhile, it is also able to output the voice endpoint post-approval information, which is used for the user to confirm whether the visually displayed voice endpoint post-approval information is the real voice endpoint post-approval information, and obtain the feedback information of the user for the voice endpoint post-approval information, for example, a text output interface showing "whether the current position is the voice endpoint? And prompting information, displaying a corresponding voice rear end point confirmation control and a corresponding non-confirmation control, and obtaining feedback information of the user aiming at the voice rear end point approval information after detecting the touch operation of the user aiming at the voice rear end point confirmation control or the non-confirmation control. The feedback information can be used as one of the indexes for adjusting the voice rear end point detection parameters for learning, so as to optimize the voice rear end point judgment result.

It should be noted that an application subject of the voice endpoint detection method provided in this embodiment may also be a client, and the client may also be a server, and the client may be a smart phone or a vehicle navigation device used by a user. When the application subject is a server, the obtaining of the target speech data in step S101 may refer to: and receiving the target voice data sent by the voice input end after the voice input end (client) of the target voice data does not detect the rear voice end point of the target voice data. For example, the voice detection process of the mobile phone or the vehicle-mounted navigation device is limited by hardware, and the voice endpoint detection effect is not good, so when the mobile phone or the vehicle-mounted navigation device does not detect the voice back endpoint of the target voice data, the target voice data is sent to the application body end (which may be a server) of this embodiment, and the application body end of this embodiment performs the voice endpoint detection by using the acoustic models (the voice recognition decoder and the voice back endpoint determination model) maintained by the application body end and the excellent hardware resources, thereby achieving a good voice endpoint detection effect.

The voice endpoint detection method provided by this embodiment decodes the target voice data by using the voice back endpoint determination model to obtain the target voice unit sequence, adjusting the voice rear end point detection parameters according to the intermediate voice recognition result, and judging the rear end point of the target voice unit sequence according to the adjusted detection parameters to obtain a voice rear end point judgment result, the method synchronously carries out the voice rear end point detection process and the voice recognition process, carries out decoupling processing on the voice end point detection process and the voice recognition process through an independently running end point detection link provided by the method, combines the voice rear end point detection process with the current voice application scene, adjusts the voice rear end point detection parameters in real time based on the intermediate voice recognition result output in the voice recognition process, realizes dynamic detection of the voice rear end point, and avoids the limitation of the existing voice end point detection process caused by excessively depending on the voice recognition result.

Moreover, because the process of performing voice endpoint detection on target voice data has lower precision requirement on the recognition result compared with the process of voice recognition, namely, the voice endpoint detection process only needs to distinguish sound and silence, but has no requirement on the specific content of voice, the path searching link in the voice endpoint detection process can be relatively simple, based on this, the modeling unit of the post-voice endpoint distinguishing model in the embodiment is a decoding unit obtained after mapping the modeling unit of the voice recognition decoder, namely, the modeling unit related to the context is mapped into the modeling unit unrelated to the context, the number of state nodes corresponding to the modeling unit unrelated to the context is greatly less than that corresponding to the modeling unit related to the context, therefore, the number of state nodes of the post-voice endpoint distinguishing model can be reduced through the mapping process, thereby reducing the computational load of the decoding process.

A second embodiment of the present application further provides a method for detecting a voice endpoint, where an application subject of the method is a server configured with a voice recognition decoder and a voice endpoint determination model, as shown in fig. 2, the method specifically includes the following steps:

s201, receiving the target voice data sent by the client, for example, after the client does not detect a post-end point of the target voice data, receiving the target voice data sent by the client;

s202, obtaining an intermediate voice recognition result generated after the voice recognition decoder carries out voice recognition on the target voice data;

s203, decoding the target voice data based on the voice rear end point discrimination model to obtain a target voice unit sequence;

s204, adjusting the voice rear end point detection parameters according to the intermediate voice recognition result to obtain target detection parameters;

s205, according to the target detection parameters, judging the rear end point of the target voice unit sequence to obtain a voice rear end point judgment result;

and S206, sending the voice rear end point judgment result to the client. For example, if the voice rear-end point determination result indicates that the voice rear-end point of the target voice data has been detected, the voice rear-end point determination result is sent to the client.

The voice endpoint detection method provided by the embodiment realizes voice endpoint detection for target voice data through the cooperation between the client and the server. And the server decodes the target voice data by using the voice rear end point discrimination model to obtain a target voice unit sequence, adjusting the voice rear end point detection parameters according to the intermediate voice recognition result, and judging the rear end point of the target voice unit sequence according to the adjusted detection parameters to obtain a voice rear end point judgment result, the method synchronously carries out the voice rear end point detection process and the voice recognition process, carries out decoupling processing on the voice end point detection process and the voice recognition process through an independently running end point detection link provided by the method, combines the voice rear end point detection process with the current voice application scene, adjusts the voice rear end point detection parameters in real time based on the intermediate voice recognition result output in the voice recognition process, realizes dynamic detection of the voice rear end point, and avoids the problem that the existing voice end point detection process has limitation due to over dependence on the voice recognition result.

A third embodiment of the present application provides a voice endpoint detection system, as shown in fig. 3, the system includes: a first voice endpoint detection module 301, a semantic detection module 302 and a second voice endpoint detection module 303;

the first voice endpoint detection module 301 is configured to perform voice back endpoint detection on target voice data through a voice recognition decoder, and send the target voice data to the semantic detection module 302 after a voice back endpoint of the target voice data is not detected; for example, the voice recognition decoder performs voice recognition and voice endpoint detection on input target voice data, acquires a corresponding recognition result when a tag (Token) of the voice recognition decoder moves to a tag for identifying voice termination, that is, performs voice endpoint detection according to trace-back information carried on a state node corresponding to the tag, and sends the target voice data to the semantic detection module 302 when the recognition result is not acquired.

The semantic detection module 302 is configured to perform semantic recognition on the target speech data to obtain a target semantic recognition result; judging whether the target semantic recognition result is matched with preset target semantic information for judging the voice rear end point; if the target semantic recognition result is matched with preset target semantic information for judging a rear voice endpoint, determining that the current time point is the rear voice endpoint of the target voice data; and if the target semantic recognition result is not matched with the preset target semantic information for judging the voice rear end point, sending the target voice data to the second voice end point detection module 303.

The second voice endpoint detection module 303 is configured to obtain an intermediate voice recognition result generated after a voice recognition decoder performs voice recognition on the target voice data; decoding the target voice data based on a voice rear end point discrimination model to obtain a target voice unit sequence; adjusting the voice rear end point detection parameters according to the intermediate voice recognition result to obtain target detection parameters; and judging the rear end point of the target voice unit sequence according to the target detection parameters to obtain a voice rear end point judgment result. Please refer to the related contents of the first embodiment of the present application, which are not described herein again.

In this embodiment, the first voice endpoint detection module may be disposed at a client, and the semantic detection module and the second voice endpoint detection module may be disposed at a server.

The voice endpoint detection system provided by the embodiment can sequentially perform voice endpoint detection on target voice data by using a plurality of endpoint detection methods, so that the reliability of the voice endpoint detection process is increased.

A fourth embodiment of the present application additionally provides a voice endpoint detection system, as shown in fig. 4, the system including: a target voice data distribution module 401, a first voice endpoint detection module 402, a semantic detection module 403, a second voice endpoint detection module 404 and a voice rear endpoint confirmation module 405;

the target voice data distribution module 401 is configured to distribute the target voice data to the first voice endpoint detection module 402, the semantic detection module 403, and the second voice endpoint detection module 404;

the first speech endpoint detection module 402 is configured to perform speech endpoint detection on the target speech data through the speech recognition decoder to obtain a first speech endpoint determination result.

The semantic detection module 403 is configured to perform semantic recognition on the target speech data to obtain a target semantic recognition result; judging whether the target semantic recognition result is matched with preset target semantic information for judging the voice rear end point or not to obtain a semantic matching result; and obtaining a second voice rear end point judgment result according to the semantic matching result.

The second voice endpoint detection module 404 is configured to obtain an intermediate voice recognition result generated after the voice recognition decoder performs voice recognition on the target voice data; decoding the target voice data based on the voice rear end point discrimination model to obtain a target voice unit sequence; adjusting the endpoint detection parameters after the voice according to the intermediate voice recognition result to obtain target detection parameters; and judging the rear end point of the target voice unit sequence according to the target detection parameters to obtain a third voice rear end point judgment result.

The voice back end point confirmation module 405 is configured to confirm the target voice back end point judgment result for the target voice data according to at least two judgment results of the first voice back end point judgment result, the second voice back end point judgment result, and the third voice back end point judgment result. For example, according to the time sequence of at least two of the first speech rear end point discrimination result, the second speech rear end point discrimination result, or the third speech rear end point discrimination result, the discrimination result obtained first is determined as the target speech rear end point discrimination result for the target speech data, or according to a predetermined priority sequence, the discrimination result with the highest degree of reliability is selected from the at least two discrimination results obtained at the same time as the target speech rear end point discrimination result for the target speech data.

The voice endpoint detection system provided by this embodiment can perform voice endpoint detection on target voice data by using multiple endpoint detection methods at the same time to obtain multiple voice rear endpoint determination results, and determine the final target voice rear endpoint determination result with the highest reliability based on the predetermined voice rear endpoint determination rule, thereby improving the voice rear endpoint detection efficiency and increasing the reliability of the voice endpoint detection process.

The fifth embodiment of the present application further provides a voice endpoint detection apparatus, since the apparatus embodiment is substantially similar to the method embodiment, and therefore the description is relatively simple, and the details of the related technical features may be found in the corresponding description of the method embodiment provided above, and the following description of the apparatus embodiment is only illustrative.

Referring to fig. 5, to understand the embodiment, fig. 5 is a block diagram of a unit of the apparatus provided in the embodiment, and as shown in fig. 5, the apparatus provided in the embodiment includes:

a target voice data obtaining unit 501 for obtaining target voice data;

an intermediate speech recognition result obtaining unit 502, configured to obtain an intermediate speech recognition result generated after the speech recognition decoder performs speech recognition on the target speech data;

a target speech unit sequence obtaining unit 503, configured to decode the target speech data based on the speech rear end point discrimination model to obtain a target speech unit sequence;

a target detection parameter obtaining unit 504, configured to adjust a voice rear end point detection parameter according to the intermediate voice recognition result, and obtain a target detection parameter;

and a voice rear end point distinguishing unit 505, configured to distinguish, according to the target detection parameter, a voice rear end point of the target voice unit sequence, and obtain a voice rear end point distinguishing result.

In the above embodiments, a voice endpoint detection method and a voice endpoint detection apparatus are provided, and in addition, a sixth embodiment of the present application also provides an electronic device, which is basically similar to the method embodiment and therefore is relatively simple to describe, and the details of the related technical features may be obtained by referring to the corresponding description of the method embodiment provided above, and the following description of the electronic device embodiment is only illustrative. The embodiment of the electronic equipment is as follows:

please refer to fig. 6 for understanding the present embodiment, fig. 6 is a schematic view of an electronic device provided in the present embodiment.

As shown in fig. 6, the electronic apparatus includes: a processor 601; a memory 602;

the memory 602 is used for storing a program for voice endpoint detection, and when the program is read and executed by the processor, the program performs the following operations:

obtaining target voice data;

and judging the voice rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judgment result.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

Claims

1. A method for voice endpoint detection, comprising:

obtaining target voice data;

2. The method of claim 1, wherein the decoding the target speech data based on the speech back-end point discrimination model to obtain a target speech unit sequence comprises:

3. The method of claim 2, wherein the modeling units are phonemes and wherein mapping the context-dependent modeling units in the speech recognition decoder to context-independent modeling units comprises:

4. The method of claim 2, wherein the recognizing the target speech data based on the speech back-end point discrimination model to obtain a target speech unit sequence comprises:

obtaining a target observation sequence of the target voice data;

sequentially calculating the probability value of the output target observation sequence in a recursion manner according to the sequence of the target observation sequence appearing in the voice rear end point distinguishing model;

decoding the target voice data by using the voice rear endpoint distinguishing model by adopting a token passing algorithm to obtain a target state path corresponding to the maximum probability value of an output target observation sequence;

and determining the voice unit sequence corresponding to the target state path as the target voice unit sequence.

5. The method of claim 4, wherein the decoding the target speech data using the speech back endpoint recognition model using a token passing algorithm comprises:

preprocessing the target voice data to obtain an audio frame;

extracting the characteristics of the audio frame to obtain target audio characteristics;

and inputting the target audio features into the voice rear endpoint distinguishing model, and decoding by adopting a token passing algorithm.

6. The method according to claim 1, wherein the adjusting the post-speech endpoint detection parameters according to the intermediate speech recognition result to obtain target detection parameters comprises:

adjusting the mute detection time of the voice rear end point according to the intermediate voice recognition result to obtain target mute detection time;

correspondingly, the judging the rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judging result includes:

and judging the rear end point of the target voice unit sequence according to the target silence detection time to obtain a voice rear end point judgment result.

7. The method according to claim 6, wherein the adjusting the silence detection time of the voice backend point according to the intermediate voice recognition result to obtain the target silence detection time comprises:

and if the intermediate voice recognition result is not changed in a first preset time period, shortening the mute detection time of the voice rear end point, and obtaining the target mute detection time for judging the voice rear end point.

8. The method of claim 1, further comprising: if the intermediate voice recognition result is not changed within a second preset time period, performing semantic recognition on target voice data to obtain a target semantic recognition result;

judging whether the target semantic recognition result is matched with preset target semantic information for judging the voice rear end point;

the decoding the target voice data based on the voice rear endpoint discrimination model to obtain a target voice unit sequence comprises the following steps:

and if the target semantic recognition result is not matched with the target semantic information for judging the voice rear end point, decoding the target voice data based on the voice rear end point judging model to obtain a target voice unit sequence.

9. The method of claim 8, further comprising: and if the target semantic recognition result is matched with preset target semantic information for judging the voice rear end point, determining that the current time point is the voice rear end point of the target voice data.

10. The method of claim 1, wherein obtaining target speech data comprises:

and if the voice input end of the target voice data is not detected to obtain the voice rear end point of the target voice data, receiving the target voice data sent by the voice input end.

11. The method of claim 1, further comprising:

and if the voice rear end point judgment result shows that the voice rear end point of the target voice data is detected, outputting identification information corresponding to the voice rear end point.

12. The method of claim 11, further comprising:

outputting voice rear end point approval information, wherein the voice rear end point approval information is used for a user to confirm whether the voice rear end point is a real voice rear end point;

and obtaining feedback information of the user aiming at the voice post-endpoint approval information.

13. A method for voice endpoint detection, comprising:

receiving the target voice data sent by a client;

judging the rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judgment result;

and sending the voice rear end point judgment result to the client.

14. The method of claim 13, wherein sending the post-speech endpoint determination result to the client comprises:

and if the voice rear end point judgment result shows that the voice rear end point of the target voice data is detected, sending the voice rear end point judgment result to the client.

15. A voice endpoint detection system, comprising: the system comprises a first voice endpoint detection module, a semantic detection module and a second voice endpoint detection module;

16. The system of claim 15, wherein the first voice endpoint detection module is disposed at a client, and wherein the semantic detection module and the second voice endpoint detection module are disposed at a server.

17. A voice endpoint detection system, comprising: the system comprises a target voice data distribution module, a first voice endpoint detection module, a semantic detection module, a second voice endpoint detection module and a voice rear endpoint confirmation module;

the voice rear end point confirmation module is used for confirming a target voice rear end point judgment result aiming at the target voice data according to the first voice rear end point judgment result, the second voice rear end point judgment result and the third voice rear end point judgment result.

18. The system according to claim 17, wherein the confirming a target speech rear end point discrimination result for the target speech data based on the first speech rear end point discrimination result, the second speech rear end point discrimination result, and the third speech rear end point discrimination result comprises: and determining the firstly obtained discrimination result as a target voice rear end point discrimination result aiming at the target voice data according to the time sequence of at least two discrimination results of the first voice rear end point discrimination result, the second voice rear end point discrimination result or the third voice rear end point discrimination result.

19. The system according to claim 17, wherein the confirming a target speech rear end point discrimination result for the target speech data based on the first speech rear end point discrimination result, the second speech rear end point discrimination result, and the third speech rear end point discrimination result comprises:

and selecting the discrimination result with the highest reliability from at least two discrimination results obtained at the same time as a target voice rear end point discrimination result aiming at the target voice data according to a preset priority order.

20. An apparatus for detecting a rear end point of speech, comprising:

a target voice data obtaining unit for obtaining target voice data;

a target detection parameter obtaining unit, configured to adjust a voice rear end point detection parameter according to the intermediate voice recognition result, and obtain a target detection parameter;

21. An electronic device, comprising:

a processor;

a memory for storing a voice backend detection program, which when read executed by the processor performs the following operations:

obtaining target voice data;