CN113345473A

CN113345473A - Voice endpoint detection method and device, electronic equipment and storage medium

Info

Publication number: CN113345473A
Application number: CN202110703540.6A
Authority: CN
Inventors: 王庆然; 万根顺; 高建清; 刘聪; 王智国; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-09-03
Anticipated expiration: 2041-06-24
Also published as: CN113345473B

Abstract

The invention provides a voice endpoint detection method, a voice endpoint detection device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a real-time transcription text of a voice data stream and a voice section of the voice data stream; based on the semantic features of the real-time transcribed text and the voice features of the voice sections, carrying out mute detection on the voice sections to obtain a mute detection sequence of the voice sections; and performing voice endpoint detection on the voice data stream based on the silence detection sequence of the voice segments. The method, the device, the electronic equipment and the storage medium provided by the invention provide semantic features for silence detection as reference, simultaneously give consideration to the operation efficiency of voice endpoint detection, and are beneficial to realizing the voice endpoint detection with real-time and low power consumption. The silence detection combines the voice characteristic and the semantic characteristic, so that the anti-interference capability of voice endpoint detection can be greatly improved, voice fragments without specific semantics or irrelevant semantics are filtered, and the problem of early interruption of a man-machine interaction process caused by false triggering is avoided.

Description

Voice endpoint detection method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of voice interaction technologies, and in particular, to a method and an apparatus for detecting a voice endpoint, an electronic device, and a storage medium.

Background

In order to implement a Voice-based human-computer interaction function, Voice endpoints in a segment of Voice are usually identified through Voice endpoint Detection (VAD) technology, so as to obtain a valid Voice segment for performing a subsequent operation.

Compared with the traditional VAD technology, the VAD difficulty in the man-machine conversation scene is that not only the noise irrelevant to the human voice needs to be filtered more accurately, but also the answer content which is not clear semantic information or irrelevant to the current scene content needs to be filtered according to the semantic content answered by the user, and no response is made to the answer content.

The existing VAD technology can only detect human voice/non-human voice, can not analyze semantic information contained in voice, and can wrongly judge environmental noise or noise as normal human voice in a complex scene, so that an obtained effective voice section contains a large amount of meaningless contents, and the human-computer interaction process is interrupted in advance. In addition, a large amount of meaningless content is introduced into subsequent voice processing, so that the operation delay and unnecessary power consumption of the system are increased, and the interactive experience is influenced.

Disclosure of Invention

The invention provides a voice endpoint detection method, a voice endpoint detection device, electronic equipment and a storage medium, which are used for solving the problems that in the prior art, voice endpoint detection can only detect voices/non-voices, so that operation delay, power consumption increase and interaction early interruption are caused.

The invention provides a voice endpoint detection method, which comprises the following steps:

acquiring a real-time transcription text of a voice data stream and a voice section of the voice data stream;

based on the semantic features of the real-time transcription text and the voice features of the voice sections, carrying out silence detection on the voice sections to obtain a silence detection sequence of the voice sections, wherein the silence detection sequence represents that a plurality of continuous sections in the voice sections are active voice or silence;

and performing voice endpoint detection on the voice data stream based on the silence detection sequence of the voice segment.

According to a voice endpoint detection method provided by the present invention, the performing silence detection on the voice segment based on the semantic features of the real-time transcription text and the voice features of the voice segment to obtain a silence detection sequence of the voice segment includes:

performing character decoding on the content characteristics of the voice segment, and determining a character decoding result as the silence detection sequence;

the content features are obtained by fusing the semantic features of the real-time transcription text and the voice features of the voice sections.

According to a voice endpoint detection method provided by the present invention, the character decoding of the content features of the voice segment includes:

based on the semantic features and the decoding state at the current decoding moment, performing attention conversion on the voice features to obtain voice context features at the current decoding moment;

determining the content characteristics of the current decoding moment based on the voice context characteristics of the current decoding moment;

performing character decoding based on the content characteristics of the current decoding moment to obtain a decoding result of the current decoding moment;

the decoding state of the current decoding moment is determined based on the decoding state of the last decoding moment and the decoding result, and the character decoding result is the decoding result of the final decoding moment.

According to a voice endpoint detection method provided by the present invention, the performing attention conversion on the voice feature based on the semantic feature and the decoding state at the current decoding time to obtain the voice context feature at the current decoding time includes:

determining attention weight of each frame feature in the voice features based on the semantic features and the decoding state at the current decoding moment;

and performing weighted fusion on each frame feature based on the attention weight of each frame feature to obtain the speech context feature at the current decoding moment.

According to a voice endpoint detection method provided by the present invention, the voice endpoint detection is performed on the voice data stream based on the silence detection sequence of the voice segment, including:

determining the time boundary of each segment in the voice segment based on the time length of the voice segment and the length of the silence detection sequence;

and performing voice endpoint detection on the voice data stream based on the silence detection sequence of each voice segment in the voice data stream and the time boundary of each segment.

According to the voice endpoint detection method provided by the invention, the acquiring of the real-time transcription text of the voice data stream comprises the following steps:

based on the audio energy of each voice frame in the voice data stream, carrying out mute segment filtering on the voice data stream;

and performing real-time transcription on the voice data stream after the silence segment is filtered to obtain the real-time transcription text.

According to the voice endpoint detection method provided by the invention, the starting point of the voice data stream is the tail end point of the last effective voice segment.

The present invention also provides a voice endpoint detection apparatus, comprising:

the data acquisition unit is used for acquiring a real-time transcription text of a voice data stream and a voice section of the voice data stream;

a silence detection unit, configured to perform silence detection on the voice segment based on the semantic features of the real-time transcribed text and the voice features of the voice segment to obtain a silence detection sequence of the voice segment, where the silence detection sequence indicates that a plurality of continuous segments in the voice segment are active voices or silences;

and the endpoint detection unit is used for carrying out voice endpoint detection on the voice data stream based on the silence detection sequence of the voice segment.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of any one of the voice endpoint detection methods.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the voice endpoint detection method as described in any of the above.

The voice endpoint detection method, the voice endpoint detection device, the electronic equipment and the storage medium provided by the invention have the advantages that the real-time transcription text of the voice data stream is obtained through real-time voice recognition, the semantic features are provided for silence detection as reference, the operation efficiency of voice endpoint detection is considered, and the realization of real-time and low-power-consumption voice endpoint detection is facilitated. The silence detection combines the voice characteristic and the semantic characteristic, so that the anti-interference capability of voice endpoint detection can be greatly improved, voice fragments without specific semantics or irrelevant semantics are filtered, and the problem of early interruption of a man-machine interaction process caused by false triggering is avoided. The silence detection sequence is used for integrally representing the silence detection result of each segment in the voice segment, compared with the silence detection at the voice frame level, the silence detection sequence can further deal with noise interference and ensure the reliability of voice endpoint detection.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flow chart of a voice endpoint detection method provided by the present invention;

FIG. 2 is a flowchart illustrating step 120 of the voice endpoint detection method provided by the present invention;

FIG. 3 is a flowchart illustrating step 130 of the voice endpoint detection method provided by the present invention;

FIG. 4 is a flow chart illustrating the real-time speech recognition in step 110 of the method for detecting a speech endpoint according to the present invention;

FIG. 5 is a flow chart of a voice endpoint detection method provided by the present invention;

FIG. 6 is a schematic structural diagram of a voice endpoint detection apparatus provided in the present invention;

fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, human-computer interaction based on voice usually includes first voice segment detection and then semantic understanding, and specifically, the method can be divided into three steps, namely: detecting effective voice fragments of the user speaking, extracting semantic information of the user speaking content from the effective voice fragments, and editing answer content according to the semantic information.

The first step is to detect the valid speech segment of the speech of the user, which is currently realized by a general VAD technique, and the general VAD technique can screen out the actual speech part of the user in the speech data and filter out the noise such as environmental noise. Each time the user speaks a sentence, the conversation system will extract a valid speech segment. In addition, the VAD technology can also realize the function of session segmentation, because the session system itself cannot judge when the user finishes speaking and when the user invokes the answering voice, the interactive logic currently set in the session system is that once the tail end point of the effective voice segment is detected, it is determined that the user finishes speaking currently, and the semantic understanding and subsequent answering process is entered.

However, because the general VAD technology can only detect voice/non-voice, and cannot analyze semantic information contained in voice, the ability of resisting environmental noise interference is weak, when environmental noise (such as sound and current sound of knocking a desk) or nearby people speak (side channel voice), the VAD detection result may be abnormal, and the specific reason is two, firstly, the environmental noise without voice or artificial noise (such as laughter sound and coughing sound) is judged to be normal voice content, so that the interactive process is interrupted in advance, and a voice segment without actual content is returned; and secondly, intercepting returned long-pause nonsense speech, such as a series of tone words, pause words, nonsense content irrelevant to answer content and the like, wherein the nonsense speech cannot provide effective semantic information for a conversation system, but can interrupt an interaction process in advance, so that the conversation system is difficult to obtain real user speech content. Due to the fact that general VAD technology is very likely to be abnormal, the probability of false triggering of the interactive logic in the conversation system is very high, the conversation system is extremely unstable, and user experience is very poor.

In order to reduce the probability of false triggering, it can be considered to introduce semantic understanding technology into the above interaction logic. However, the introduction of semantic understanding into the interactive logic will result in the increase of delay of the conversation system, and the user may pause for a long time after speaking is finished to obtain a response, which is limited by the real-time requirement of the conversation system, and how to improve the VAD technique more suitable for the man-machine conversation scene, thereby avoiding the early interruption of the man-machine interaction process caused by false triggering while ensuring the real-time performance, and still solving the problem urgently needed in the field of man-machine interaction.

Fig. 1 is a schematic flow diagram of a voice endpoint detection method provided by the present invention, and as shown in fig. 1, the voice endpoint detection method provided by the present invention can be applied to various common voice recognition scenes, such as a conference transfer scene and an intelligent customer service scene, and can also be applied to a dialog scene that requires real-time understanding of semantics and has a strict requirement on noise false triggering. The method comprises the following steps:

step 110, obtaining a real-time transcription text of the voice data stream and a voice section of the voice data stream.

Here, the voice data stream is a data stream obtained by real-time recording, and the real-time recording may be voice recording or video recording, which is not specifically limited in this embodiment of the present invention.

When the voice data stream is recorded in real time, the voice recognition can be carried out on the recorded voice data stream in real time, so that the real-time transcription text of the voice data stream is obtained. The real-time transcription text directly reflects the user speaking content in the voice data stream, real-time voice recognition of the voice data stream is carried out at the same time of the recorded voice data stream, no additional processing time is occupied, and the method is efficient, simple and convenient.

The voice segment in the voice data stream is a segment of data obtained by intercepting the real-time recorded voice data stream, the duration of the voice segment is known, and the duration of each voice segment intercepted by the voice endpoint detection method aiming at the real-time recorded voice data stream in the operation process can be the same or different. For example, the duration of the voice segment may be preset, and in the real-time recording process, the voice data stream is intercepted once every preset duration, so as to obtain the latest recorded voice segment with a preset duration.

And step 120, based on the semantic features of the real-time transcribed text and the voice features of the voice segments, performing silence detection on the voice segments to obtain a silence detection sequence of the voice segments, wherein the silence detection sequence indicates that a plurality of continuous segments in the voice segments are active voice or silence.

Specifically, the real-time transcription text is derived from a real-time recorded voice data stream, and the voice segment is also derived from the real-time recorded voice data stream, so that the user speech content contained in the real-time transcription text necessarily covers the user speech content in a segment of the voice segment in the voice data stream.

Different from the general VAD technology, which is only considered from the perspective of voice characteristics when performing silence detection on a section of voice segment, the embodiment of the present invention considers not only the voice characteristics of the section of voice segment, but also the semantic characteristics of the real-time transcribed text that can cover the speech content of the user of the section of voice segment when performing silence detection on the section of voice segment. The silence detection mode combining the voice characteristics and the semantic characteristics ensures that when judging whether each time in a voice segment is silence or active voice, the silence detection mode not only does not depend on the information of acoustic aspects, such as sound intensity, loudness, pitch and the like, embodied by the voice characteristics, but also refers to the information of semantic aspects embodied by the semantic characteristics, such as whether semantic content exists or not, whether the existing semantic content is related to a conversation theme or not, and the like.

The resulting silence detection result is represented in the form of a silence detection sequence that naturally divides the speech segment into several consecutive segments and sequentially identifies each segment as active speech or silence. It should be noted that, for the case of dividing a speech segment into several segments, in the embodiment of the present invention, the duration of each segment divided in a single speech segment is equal by default.

Furthermore, the silence detection combining the semantic features and the voice features can be realized through a pre-trained neural network model, for example, the semantic features and the voice features can be input into the pre-trained neural network model together for silence detection, or the semantic features and the voice features can be fused at present, and then the fused features are input into the pre-trained neural network model for silence detection. The neural network model for silence detection may be a structure of an encoder and a decoder, the encoder performs encoding fusion of semantic features and voice features, the decoder decodes the fused features to output a silence detection sequence, and the neural network model may also be a decoder, and decodes while fusing the features in a decoding process, which is not specifically limited in the embodiment of the present invention.

Step 130, performing voice endpoint detection on the voice data stream based on the silence detection sequence of the voice segment.

Specifically, since the duration of the speech segment itself is known, after obtaining the silence detection sequence of the speech segment, the duration of each segment in the speech segment can be obtained, and then the duration of the active speech or the silence in the speech segment can be determined. The application of the voice segment makes up the problem that the mute detection output sequence cannot represent an accurate time boundary, so that the mute detection sequence can be aligned with a time axis.

On the basis, the voice endpoint detection of the voice data stream can be realized by combining the duration of the active voice or the silence of the continuous voice segments in the voice data stream, so as to determine the head end point and the tail end point of the effective voice segment possibly contained in the voice data stream, and output the effective voice segment for the subsequent conversation.

The method provided by the embodiment of the invention obtains the real-time transcription text of the voice data stream through real-time voice recognition, provides semantic features for silence detection as reference, simultaneously considers the operation efficiency of voice endpoint detection, and is beneficial to realizing the real-time low-power-consumption voice endpoint detection. The silence detection combines the voice characteristic and the semantic characteristic, so that the anti-interference capability of voice endpoint detection can be greatly improved, voice fragments without specific semantics or irrelevant semantics are filtered, and the problem of early interruption of a man-machine interaction process caused by false triggering is avoided. The silence detection sequence is used for integrally representing the silence detection result of each segment in the voice segment, compared with the silence detection at the voice frame level, the silence detection sequence can further deal with noise interference and ensure the reliability of voice endpoint detection.

Based on the above embodiment, step 120 includes:

performing character decoding on the content characteristics of the voice section, and determining the character decoding result of the voice section as a mute detection sequence; the content features are obtained by fusing the semantic features of the real-time transcription text and the voice features of the voice sections.

In particular, since the speech segments are time-sequenced, the silence detection process is also a process of sequencing output. In the embodiment of the invention, the process of carrying out silence detection on the voice segment can carry out character decoding on the content characteristic which integrates the semantic characteristic of the real-time transcribed text and the voice characteristic of the voice segment. The manner of character decoding here can be implemented by referring to decoders in general text generation tasks, such as text translation, abstract generation, and the like, and all involve schemes of performing character decoding based on features obtained by encoding to generate target text, for example, character decoding for content features can be implemented by a decoder in an encoder + decoder structure.

The content feature used for character decoding may be obtained by performing feature encoding and fusion on the semantic feature of the real-time transcription text and the voice feature of the voice segment before character decoding, for example, the semantic feature and the voice feature may be directly added as the content feature, or the semantic feature and the voice feature may be spliced as the content feature; or in the character decoding process, the semantic features are fused into an attention mechanism in the decoding process for addition, so that the voice features required by current decoding are fused with the semantic features of the real-time transcription text in the decoding process of each character, and the content features required by the current decoding obtained by fusion are decoded.

In the character decoding result obtained by the character decoding, each character corresponds to one segment in the voice segment, and each character is used for indicating that the corresponding segment is active voice or mute. For example, the character decoding result of a speech segment may be "silent", which means that the speech segment can be divided into three segments, wherein the first segment is an active speech segment, the second segment is a silent segment, and the third segment is an active speech segment.

Based on any of the above embodiments, fig. 2 is a schematic flowchart of step 120 in the voice endpoint detection method provided by the present invention, as shown in fig. 2, step 120 includes:

step 121, based on the semantic features and the decoding state at the current decoding time, performing attention conversion on the voice features to obtain voice context features at the current decoding time;

step 122, determining the content characteristics of the current decoding moment based on the voice context characteristics of the current decoding moment;

step 123, performing character decoding based on the content characteristics of the current decoding time to obtain a decoding result of the current decoding time;

the decoding state of the current decoding moment is determined based on the decoding state of the previous decoding moment and the decoding result, and the character decoding result is the decoding result of the final decoding moment.

Specifically, considering that a speech segment is a segment of speech in a speech data stream, and a real-time transcription text covers the whole semantics of the speech data stream, the semantic features of the real-time transcription text not only reflect the semantic information contained in the speech segment, but also reflect the semantic information contained in the speech data preceding the speech segment in the speech data stream. If only the semantic features of the real-time transcribed text and the speech features of the speech segment are added or spliced, the semantic information contained in the speech segment and the semantic information contained in the speech data before the speech segment cannot be distinguished, and the content features obtained thereby are not reasonable. Therefore, in the embodiment of the invention, in the character decoding process, the semantic features are fused into the attention mechanism in the decoding process for addition, so that the voice features required by current decoding are fused with the semantic features of the real-time transcribed text in the decoding process of each character, and the representation capability of the content features obtained by fusion on two levels of voice and semantic is improved.

Further, in the process of character decoding, for the current decoding time, attention conversion may be performed on the speech feature of the speech segment based on the decoding state of the current decoding time in combination with the semantic feature of the real-time transcribed text, so as to obtain the speech feature, i.e., the speech context feature, after attention adjustment is performed on the current decoding time in combination with the semantic information and the decoding state. Wherein, the decoding state at any decoding time comprises historical information generated in the decoding process before the decoding time.

Here, each feature in the speech context features is strongly or weakly adjusted based on the semantic information and the history state in the decoding process, and in the adjusted speech context features, the speech feature corresponding to the semantic information related to the session is enhanced, and the speech feature corresponding to the semantic information not related to the session or having no specific semantic information is weakened, so that when the speech context features are subsequently applied to perform character decoding, the part of active speech and the part of silence can be more easily distinguished, and the accuracy and reliability of character decoding are improved.

After obtaining the speech context feature at the current decoding time, the speech context feature may be directly used as the content feature at the current decoding time for decoding, or the speech context feature may be fused with the speech feature to be decoded at the current decoding time determined based on the previous decoding state and the decoding result, and the content feature may be used for decoding.

After the content feature of the current decoding time is obtained, character decoding can be performed based on the content feature of the current decoding time, and for example, the content feature and the decoding state of the current decoding time and the decoding result of the previous decoding time can be combined for decoding, so as to improve the reliability of character decoding. After the character decoding at the current decoding moment is finished, the character decoded and output at the current decoding moment can be obtained, and the character decoded and output at the current decoding moment is spliced with the decoding result at the previous decoding moment to obtain the decoding result at the current decoding moment. For example, if the decoding result at the previous decoding time is "silent," and the character decoded and output at the current decoding time is "silent," the "silent" may be spliced after the "silent," so as to obtain the decoding result "silent".

The method provided by the embodiment of the invention fuses semantic information in the decoding process, and improves the accuracy of silence detection, thereby improving the accuracy of voice endpoint detection.

Based on any of the above embodiments, step 121 includes:

Specifically, for the current decoding time, attention interaction may be performed on the semantic features of the real-time transcribed text and the decoding state at the current decoding time through an attention mechanism, so as to obtain an attention weight of the frame feature of each frame of speech in the speech features, which may be expressed as follows, for example:

α′_t＝softmax(v^T*tanh(q_t+K))

where t denotes the current decoding time, α'_tAttention weight, q, of each frame feature in speech features at the current decoding moment_tAnd K is the decoding state at the current decoding moment and is a semantic feature. v. of^TIs a predetermined weight matrix.

After obtaining the attention weight of each frame feature in the speech features, performing weighted fusion on each frame feature through the attention weight, so as to adjust the strength of each frame feature in the speech features, thereby obtaining the speech context feature at the current decoding time, which may be represented as follows, for example:

c′_t＝∑α′_t*h_t

in formula (II), c'_tI.e. the speech context characteristic at the current decoding moment, h_tFor each frame feature in the speech features at the current decoding moment.

Based on any of the above embodiments, step 122 includes:

and determining the content characteristic of the current decoding moment by combining the speech context characteristic of the current decoding moment and the speech decoding characteristic of the current decoding moment.

Here, the speech decoding feature at the current decoding time may be a speech feature to be decoded at the current decoding time in a case where character decoding is performed while ignoring the semantic feature. The speech decoding characteristic at the current decoding time may be obtained by adjusting the speech characteristic of the speech segment based on the decoding state and the decoding result at the previous decoding time.

For example, the context feature of the speech at the current decoding time and the speech decoding feature at the current decoding time may be spliced to serve as the content feature at the current decoding time, or the context feature of the speech at the current decoding time and the speech decoding feature at the current decoding time may be added to serve as the content feature at the current decoding time, or the context feature at the current decoding time may be further extracted after the context feature and the speech decoding feature are spliced to serve as the content feature at the current decoding time.

The content feature obtained by the addition may be expressed as follows:

C_new＝c_t+c′_t

in the formula, C_newAs a characteristic of the content at the current decoding time, c_tThe feature of speech decoding at the current decoding moment.

Based on any of the above embodiments, fig. 3 is a schematic flowchart of step 130 in the voice endpoint detection method provided by the present invention, as shown in fig. 3, step 130 includes:

step 131, determining the time boundary of each segment in the voice segment based on the duration of the voice segment and the length of the silence detection sequence.

Specifically, it is considered that the silence detection sequence obtained in step 120 only represents that each segment in the speech segment is active speech or silence, and cannot represent the corresponding position of each segment in the speech segment on the time axis, so that the time boundary of each segment in the speech segment needs to be solved.

For dividing a speech segment into several segments, the embodiment of the present invention defaults to the fact that the duration of each segment divided in a single speech segment is equal. Since the time length of the voice segment is known, after the silence detection sequence of the voice segment is obtained, the time length of each segment can be determined based on the time length of the voice segment and the length of the silence detection sequence, and then the time boundary of each segment is determined according to the position of each segment in the voice segment.

For example, assuming that the duration of each speech segment is 40 frames, the silence detection sequence of three consecutive speech segments is "speech | silence | speech | silence | where the first speech segment is decoded to obtain 4 words" speech | silence | speech ", so that the duration of each segment in the first speech segment is 40/4-10 frames, and the time boundaries of the 4 segments are: 0-10 frames, 10-20 frames, 20-30 frames, 30-40 frames. The second speech segment decodes only two words, "speech", and the duration of each segment in the second speech segment is 40/2 ═ 20 frames, and the time boundaries of the two segments are: 40-60 frames, 60-80 frames.

Step 132, performing voice endpoint detection on the voice data stream based on the silence detection sequence of each voice segment in the voice data stream and the time boundary of each segment.

Specifically, after the time boundary of each segment in each speech segment is obtained, the speech endpoint detection may be performed based on whether the corresponding segment represented by the silence detection sequence of each speech segment is active speech or silence and the time boundary of each segment in each speech segment. Here, the voice endpoint detection may be implemented based on a preset detection rule of the head endpoint and the tail endpoint, and the specific detection rule may be adjusted according to a specific application scenario of the voice endpoint detection, which is not specifically limited in the embodiment of the present invention.

Based on any of the above embodiments, performing voice endpoint detection on the voice data stream in step 132 may be specifically divided into performing head endpoint detection and tail endpoint detection:

the adjacent and same type of segments can be integrated before detection, and the time boundary of each segment and the type of each segment in the voice data stream are obtained after integration. The type of segment here is active speech or silence.

For head end point detection, active speech detected for a continuous start duration may be determined to be the beginning of the active speech, thereby locating the head end point. Here, the start time period may be preset, for example, 20 frames, and for example, 15 frames. Taking the starting duration of 20 frames as an example, assuming that 0-10 frames are active speech segments and 10-20 frames are silent segments, since the frame length of 0-10 frames is less than 20 frames, 0-20 frames belong to non-speech segments. From frame 20, the continuous 60 frames are active voice, the frame length 60 exceeds 20 frames, therefore, the front end point of the effective voice segment can be judged to be detected by frame 20, namely the first 40 frames, and the segment of the 20-80 frames is the effective voice segment.

In addition, some auxiliary means may be provided for the detection of the valid speech segment, for example, a 30-frame silence protection frame policy may be provided, for example, after 20 active speech segments are detected, if the duration of the connected silence speech segment does not exceed 30 frames, the valid speech segment may be considered to be still continuous, thereby ensuring that when the user pauses after speaking one or two words and then speaks the next word, false triggering does not occur.

For end point detection, on the basis that a leading end point has been detected, if silence for a continuous termination period is detected, it is determined that the active speech is over, thereby locating an end point. Here, the termination time period may be preset, for example, 30 frames, and for example, 40 frames. Taking the terminating time length of 30 frames as an example, assuming that there are four consecutive silence segments with 10 frame lengths starting from the 80 th frame and the total duration is 40 frames, the end point can be determined in the third silence segment, and thus the 80 th to 120 th frames are determined to be silence segments.

Based on any of the above embodiments, fig. 4 is a schematic flowchart of the real-time speech recognition in step 110 of the speech endpoint detection method provided by the present invention, and as shown in fig. 4, in step 110, acquiring a real-time transcription text of a speech data stream includes:

step 111, based on the audio energy of each voice frame in the voice data stream, performing silent segment filtering on the voice data stream;

and step 112, performing real-time transcription on the voice data stream after the silence segment is filtered to obtain a real-time transcription text.

Specifically, considering that more computing resources are required to be consumed for voice recognition of a voice data stream recorded in real time, a filtering step may be added before voice recognition, and by acquiring audio energy of each voice frame in the voice data stream recorded in real time, and taking the size of the audio energy as a basis, whether each voice frame may belong to a silent segment is judged, so that voice frames which may belong to the silent segment in the voice data stream are filtered, and only the remaining voice frames after filtering are transcribed in real time, so that the data amount of real-time transcription is reduced, and further, the requirement of real-time transcription on the computing resources is reduced. For example, an energy threshold and a preset number of frames may be preset, and if the number of speech frames whose audio energy is continuously lower than the energy threshold exceeds the preset number of frames, it is determined that the segment where the speech frame is located is a silence segment for filtering.

And then, the voice data stream is transcribed in real time, the acoustic characteristics of each voice frame in the voice data stream can be extracted, and then decoding is carried out on the basis of a real-time acoustic model, so that a corresponding decoded text is obtained and used as a real-time transcription text. In the specific transcription process, in consideration of the requirement of continuity of text semantics, semantic information in a text obtained by decoding an excessively short sliding-window audio may be seriously lost, and preferably, when acoustic features are extracted from a voice data stream, the acoustic features may be cumulatively extracted in a non-sliding-window manner, for example, an acoustic feature of each voice frame may be determined by applying a Filter Bank or MFCC (Mel-scale Frequency Cepstral Coefficients) features.

The method provided by the embodiment of the invention can be used for filtering the mute segments through the audio energy, thereby saving a large amount of voice decoding calculation amount.

Based on any of the above embodiments, step 111 may be implemented by:

presetting two energy thresholds, including a lower energy threshold P_lowAnd a higher energy threshold P_hig。

When currently in a silence segment or beginning to silence detection:

if the audio energy P of the current speech frame<P_lowThen the user can jump to the mute segment directly;

if the audio energy P of the current speech frame is more than or equal to P_lowAnd P is<P_highThen a transition segment can be jumped to.

When currently in the transition segment:

if the audio energy P of the current speech frame falls back to P_lowThen, the user can jump to a mute segment;

if when it is usedThe audio energy P of the preceding speech frame is more than or equal to P_highThen a jump to a speech segment can be made and the speech segment begins.

When currently in a speech segment:

if the audio energy P of the current speech frame falls back to P_lowWhen the number of the frames exceeds M continuously, skipping to a mute segment and ending a voice segment;

if the audio energy P of the previous speech frame falls back to P_lowWhen the number of the following frames does not exceed M frames continuously, the voice segment can be maintained, and the monitoring is continued.

According to any of the above embodiments, the starting point of the voice data stream is the end point of the last valid voice segment.

In particular, the real-time recording of the voice data stream is not always continuous without interruption. On the contrary, considering that the speech contents represented between the valid voice segments are relatively independent, in the voice endpoint detection process, after an end point is detected, a valid voice segment is considered to be ended, and the contents contained in the voice data stream recorded later are irrelevant to the contents contained in the valid voice segment recorded before and ended, so that the end point of the last valid voice segment is used as the starting point of restarting the recorded voice data stream.

The resulting speech data stream does not contain the content of the previously recorded valid speech segments, and therefore the semantic features referred to in the silence detection of the speech segments do not relate to the content of the previously recorded valid speech segments. By using the end point of the last valid voice segment as the starting point of the voice data stream for restarting recording, the contents irrelevant to the current voice data stream are filtered, which is helpful for improving the reliability of silence detection.

Based on any of the above embodiments, fig. 5 is a schematic flow chart of the voice endpoint detection method provided by the present invention, and as shown in fig. 5, for a real-time recorded voice data stream, two branches may be divided to be processed respectively.

One path of the audio is accumulated after the last effective voice segment and is sent to decoding, in other words, the tail end point of the last effective voice segment is used as the starting point of the voice data stream, and the voice data stream recorded in real time is sent to decoding, so that real-time voice transcription is realized. In the process, the acoustic features of each speech frame in the speech data stream can be extracted first, then, based on the audio energy of each speech frame, the speech data stream is subjected to silence segment filtering, and the acoustic features of each speech frame in the speech data stream after the silence segment filtering are input into a real-time acoustic model for decoding, so as to obtain a real-time transcription text of the speech data stream.

On the basis, the semantic extraction can be carried out on the real-time transcription text, so that the semantic features of the real-time transcription text are obtained, namely the real-time transcription text is converted into high-dimensional vector expression. The specific extraction method can map each word in the real-time transcription text into an embedding vector, and then extract the hidden vector of each word through structures such as a long-short-time memory network, a recurrent neural network and the like and splice to obtain semantic features.

Alternatively, a speech segment may be accumulated and then decoded, for example, a speech segment may be truncated every accumulated 40 frames as shown in fig. 5, and then silence detection may be performed in combination with semantic features. In the process, the speech features of the speech segments may be extracted first, and further encoding processing may be performed on the speech features on the basis, for example, encoding is performed through an encode part to obtain a high-dimensional feature vector encode. The encode part can be a long-time memory network, a cyclic neural network and other structures. After that, in the decode part, the speech segment may be subjected to silence detection by combining the semantic features and the encode vector obtained by further encoding the speech features, so as to output a silence detection sequence of the speech segment.

After obtaining the silence detection sequence, the time boundary of each segment in the speech segment can be determined by combining the duration of the speech segment and the length of the silence detection sequence, so as to perform speech endpoint detection. For example, in fig. 5, the sequence of the concatenated silence detection of three speech segments is "voice | silence | speech | silence", where the first speech segment is decoded to obtain 4 words "voice | silence | speech", the duration of each segment in the first speech segment is 40/4 ═ 10 frames, and the time boundaries of the 4 segments are: 0-10 frames, 10-20 frames, 20-30 frames, 30-40 frames. The second speech segment decodes two words, "speech", where the duration of each segment in the second speech segment is 40/2 ═ 20 frames, and the time boundaries of the two segments are: 40-60 frames, 60-80 frames. The third speech segment is decoded to obtain 4 words "quiet | quiet", where the duration of each segment in the third speech segment is 40/4 ═ 10 frames, and the time boundaries of the 4 segments are: 80-90 frames, 90-100 frames, 100-110 frames, 110-120 frames. In fig. 5, the squares filled with diagonal lines represent "words", and the squares without filled spaces represent "silence". On the basis, the voice endpoint in the voice data stream can be detected by combining the preset endpoint detection rule.

Based on any of the above embodiments, fig. 6 is a schematic structural diagram of a voice endpoint detection apparatus provided by the present invention, as shown in fig. 6, the apparatus includes:

a data obtaining unit 610, configured to obtain a real-time transcription text of a voice data stream and a voice segment of the voice data stream;

a silence detection unit 620, configured to perform silence detection on the voice segment based on the semantic features of the real-time transcribed text and the voice features of the voice segment to obtain a silence detection sequence of the voice segment, where the silence detection sequence indicates that a plurality of continuous segments in the voice segment are active voices or silences;

an endpoint detection unit 630, configured to perform voice endpoint detection on the voice data stream based on the silence detection sequence of the voice segment.

The device provided by the embodiment of the invention acquires the real-time transcription text of the voice data stream through real-time voice recognition, provides semantic features for silence detection as reference, simultaneously considers the operation efficiency of voice endpoint detection, and is beneficial to realizing the real-time and low-power-consumption voice endpoint detection. The silence detection combines the voice characteristic and the semantic characteristic, so that the anti-interference capability of voice endpoint detection can be greatly improved, voice fragments without specific semantics or irrelevant semantics are filtered, and the problem of early interruption of a man-machine interaction process caused by false triggering is avoided. The silence detection sequence is used for integrally representing the silence detection result of each segment in the voice segment, compared with the silence detection at the voice frame level, the silence detection sequence can further deal with noise interference and ensure the reliability of voice endpoint detection.

Based on any of the above embodiments, the silence detection unit 620 is configured to:

Based on any of the above embodiments, the endpoint detection unit 630 is configured to:

Based on any of the above embodiments, the data obtaining unit 610 is configured to:

Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a voice endpoint detection method comprising: acquiring a real-time transcription text of a voice data stream and a voice section of the voice data stream; based on the semantic features of the real-time transcription text and the voice features of the voice sections, carrying out silence detection on the voice sections to obtain a silence detection sequence of the voice sections, wherein the silence detection sequence represents that a plurality of continuous sections in the voice sections are active voice or silence; and performing voice endpoint detection on the voice data stream based on the silence detection sequence of the voice segment.

In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, enable the computer to perform the voice endpoint detection method provided by the above methods, the method comprising: acquiring a real-time transcription text of a voice data stream and a voice section of the voice data stream; based on the semantic features of the real-time transcription text and the voice features of the voice sections, carrying out silence detection on the voice sections to obtain a silence detection sequence of the voice sections, wherein the silence detection sequence represents that a plurality of continuous sections in the voice sections are active voice or silence; and performing voice endpoint detection on the voice data stream based on the silence detection sequence of the voice segment.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the voice endpoint detection methods provided above, the method comprising: acquiring a real-time transcription text of a voice data stream and a voice section of the voice data stream; based on the semantic features of the real-time transcription text and the voice features of the voice sections, carrying out silence detection on the voice sections to obtain a silence detection sequence of the voice sections, wherein the silence detection sequence represents that a plurality of continuous sections in the voice sections are active voice or silence; and performing voice endpoint detection on the voice data stream based on the silence detection sequence of the voice segment.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for voice endpoint detection, comprising:

2. The method according to claim 1, wherein said performing silence detection on the speech segments based on the semantic features of the real-time transcribed text and the speech features of the speech segments to obtain a silence detection sequence of the speech segments comprises:

3. The method according to claim 2, wherein said character-decoding the content features of the speech segments comprises:

4. The method according to claim 3, wherein the performing attention transformation on the speech feature based on the semantic feature and the decoding status at the current decoding time to obtain the speech context feature at the current decoding time comprises:

5. The method according to claim 1, wherein the performing voice endpoint detection on the voice data stream based on the silence detection sequence of the voice segments comprises:

6. The method of claim 1, wherein the obtaining real-time transcription text of the voice data stream comprises:

7. The method according to any of claims 1 to 6, wherein the starting point of the voice data stream is the end point of the last valid voice segment.

8. A voice endpoint detection apparatus, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the voice endpoint detection method according to any of claims 1 to 7.

10. A non-transitory computer readable storage medium, having a computer program stored thereon, wherein the computer program, when being executed by a processor, implements the steps of the voice endpoint detection method according to any one of claims 1 to 7.