CN113345473B

CN113345473B - Voice endpoint detection method, device, electronic equipment and storage medium

Info

Publication number: CN113345473B
Application number: CN202110703540.6A
Authority: CN
Inventors: 王庆然; 万根顺; 高建清; 刘聪; 王智国; 胡国平
Original assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2024-02-13
Anticipated expiration: 2041-06-24
Also published as: CN113345473A

Abstract

The invention provides a voice endpoint detection method, a voice endpoint detection device, an electronic device and a storage medium, wherein the voice endpoint detection method comprises the following steps: acquiring a real-time transcription text of a voice data stream and a voice segment of the voice data stream; based on the semantic features of the real-time transcribed text and the voice features of the voice segments, performing silence detection on the voice segments to obtain a silence detection sequence of the voice segments; and detecting voice end points of the voice data stream based on the silence detection sequence of the voice segments. The method, the device, the electronic equipment and the storage medium provide semantic features for silence detection as references, and simultaneously consider the operation efficiency of voice endpoint detection, thereby being beneficial to the realization of real-time and low-power consumption voice endpoint detection. Silence detection combines voice characteristics and semantic characteristics, can greatly improve the anti-interference capability of voice endpoint detection, filters voice fragments without specific semantics or irrelevant semantics, and avoids the problem of advanced interruption of human-computer interaction process caused by false triggering.

Description

Voice endpoint detection method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of voice interaction technologies, and in particular, to a method and apparatus for detecting a voice endpoint, an electronic device, and a storage medium.

Background

To implement voice-based human-computer interaction functionality, voice endpoints in a segment of voice are typically identified by voice endpoint detection (Voice Activity Detection, VAD) techniques, whereby a valid segment of voice is obtained to perform subsequent operations.

Compared with the traditional VAD technology, the VAD technology in the man-machine conversation scene has the difficulty that not only noise irrelevant to human voice needs to be filtered more accurately, but also answer content which is not clear in semantic information or irrelevant to the current scene content needs to be filtered according to the semantic content of the user answer, and no response is made to the answer content.

The existing VAD technology can only detect voice/non-voice, cannot analyze semantic information contained in voice, and can possibly judge environmental noise or considered noise as normal voice in a complex scene, so that the obtained effective voice section contains a large amount of nonsensical content, and the human-computer interaction process is interrupted in advance. In addition, introducing a large amount of meaningless content in subsequent voice processing can increase system running delay and unnecessary power consumption, affecting interaction experience.

Disclosure of Invention

The invention provides a voice endpoint detection method, a voice endpoint detection device, electronic equipment and a storage medium, which are used for solving the problems that in the prior art, voice endpoint detection can only detect human voice/non-human voice, so that operation delay, power consumption increase and interaction is interrupted in advance.

The invention provides a voice endpoint detection method, which comprises the following steps:

acquiring a real-time transcription text of a voice data stream and a voice segment of the voice data stream;

based on the semantic features of the real-time transcribed text and the voice features of the voice segments, performing silence detection on the voice segments to obtain a silence detection sequence of the voice segments, wherein the silence detection sequence indicates that a plurality of continuous segments in the voice segments are active voice or silence;

and detecting the voice endpoint of the voice data stream based on the silence detection sequence of the voice segment.

According to the voice endpoint detection method provided by the invention, the silence detection is carried out on the voice segment based on the semantic features of the real-time transcribed text and the voice features of the voice segment to obtain a silence detection sequence of the voice segment, and the method comprises the following steps:

performing character decoding on the content characteristics of the voice segment, and determining a character decoding result as the silence detection sequence;

the content features are obtained by fusing semantic features of the real-time transcribed text and voice features of the voice segments.

According to the method for detecting the voice endpoint provided by the invention, the character decoding is carried out on the content characteristics of the voice segment, and the method comprises the following steps:

Performing attention conversion on the voice feature based on the semantic feature and the decoding state of the current decoding moment to obtain the voice context feature of the current decoding moment;

determining content characteristics of the current decoding moment based on the voice context characteristics of the current decoding moment;

character decoding is carried out based on the content characteristics of the current decoding moment, and a decoding result of the current decoding moment is obtained;

the decoding state of the current decoding moment is determined based on the decoding state of the last decoding moment and the decoding result of the last decoding moment, and the character decoding result is the decoding result of the final decoding moment.

According to the voice endpoint detection method provided by the invention, the voice context feature of the current decoding moment is obtained by performing attention conversion on the voice feature based on the semantic feature and the decoding state of the current decoding moment, and the voice endpoint detection method comprises the following steps:

determining the attention weight of each frame feature in the voice features based on the semantic features and the decoding state of the current decoding moment;

and weighting and fusing each frame characteristic based on the attention weight of each frame characteristic to obtain the voice context characteristic of the current decoding moment.

According to the method for detecting a voice endpoint provided by the invention, the voice endpoint detection is performed on the voice data stream based on the silence detection sequence of the voice segment, and the method comprises the following steps:

determining the time boundary of each segment in the voice segment based on the duration of the voice segment and the length of the silence detection sequence;

and detecting the voice endpoint of the voice data stream based on the silence detection sequence of each voice segment and the time boundary of each segment in the voice data stream.

According to the method for detecting voice end points provided by the invention, the method for acquiring the real-time transfer text of the voice data stream comprises the following steps:

based on the audio energy of each voice frame in the voice data stream, carrying out mute segment filtration on the voice data stream;

and carrying out real-time transcription on the voice data stream filtered by the mute segment to obtain the real-time transcription text.

According to the voice endpoint detection method provided by the invention, the starting point of the voice data stream is the tail endpoint of the last effective voice segment.

The invention also provides a voice endpoint detection device, which comprises:

the data acquisition unit is used for acquiring real-time transcription text of the voice data stream and voice segments of the voice data stream;

The silence detection unit is used for carrying out silence detection on the voice section based on the semantic features of the real-time transcribed text and the voice features of the voice section to obtain a silence detection sequence of the voice section, wherein the silence detection sequence indicates that a plurality of continuous fragments in the voice section are active voice or silence;

and the end point detection unit is used for detecting the voice end point of the voice data stream based on the silence detection sequence of the voice segment.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of any one of the above-mentioned voice endpoint detection methods when executing the computer program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the speech endpoint detection method as described in any of the above.

According to the voice endpoint detection method, the voice endpoint detection device, the electronic equipment and the storage medium, the real-time transfer text of the voice data stream is obtained through real-time voice recognition, semantic features are provided for silence detection as references, the operation efficiency of voice endpoint detection is considered, and the realization of real-time and low-power consumption voice endpoint detection is facilitated. Silence detection combines voice characteristics and semantic characteristics, can greatly improve the anti-interference capability of voice endpoint detection, filters voice fragments without specific semantics or irrelevant semantics, and avoids the problem of advanced interruption of human-computer interaction process caused by false triggering. And the silence detection sequence is used for integrally representing the silence detection result of each segment in the voice segment, so that noise interference can be further dealt with compared with silence detection at the voice frame level, and the reliability of voice endpoint detection is ensured.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are some embodiments of the invention and that other drawings can be obtained from them without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for detecting a voice endpoint according to the present invention;

fig. 2 is a flowchart illustrating a step 120 in the voice endpoint detection method according to the present invention;

fig. 3 is a flowchart illustrating a step 130 in the voice endpoint detection method according to the present invention;

FIG. 4 is a flowchart illustrating step 110 of real-time speech recognition in the speech endpoint detection method according to the present invention;

FIG. 5 is a flowchart illustrating a method for detecting a voice endpoint according to the present invention;

FIG. 6 is a schematic diagram of a voice endpoint detection apparatus according to the present invention;

fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

At present, human-computer interaction based on voice is generally performed by detecting voice fragments and then performing semantic understanding, and the method can be divided into three steps, namely: and detecting effective voice fragments of the user speaking, extracting semantic information of the user speaking content from the effective voice fragments, and editing answer content according to the semantic information.

The first step is to detect the effective speech fragments of the user speaking, which is realized by the general VAD technology at present, and the general VAD technology can screen out the part of the speech data which is actually speaking by the user and remove noise such as environmental noise. Every time a user speaks a sentence, the conversation system extracts a valid speech segment. In addition, the VAD technology can realize the function of session cutting, and the session system can not judge when the user finishes speaking and calls the answer voice, so that the interaction logic currently set by the session system is to determine that the user finishes speaking currently once the tail end point of the valid voice segment is detected, and enter the semantic understanding and subsequent answer processes.

However, the general VAD technology can only detect voice/non-voice, and cannot analyze semantic information contained in voice, so that the anti-environmental noise interference capability is weak, when environmental noise (such as voice and current voice of a beating table) or a nearby person speaks (side channel voice), the VAD detection result may be abnormal, and the specific reasons are two, namely, the environmental noise without voice or artificial noise (such as laughing voice and coughing voice) is wrongly judged as normal voice content, so that the interaction process is interrupted in advance, and voice fragments without actual content are returned; and secondly, intercepting meaningless voices returned to a long pause, such as a series of words and pauses, meaningless contents irrelevant to answer contents and the like, wherein the meaningless voices cannot be used for effective semantic information for a conversation system, but can interrupt the interaction process in advance, so that the conversation system is difficult to obtain real user speaking contents. Because the general VAD technology is very likely to be abnormal, the probability of false triggering of the interactive logic in the session system is very high, so that the session system is extremely unstable and the user experience is very poor.

To reduce the probability of false triggers, it may be considered to introduce semantic understanding techniques into the interaction logic described above. However, introducing semantic understanding into the interactive logic can lead to the increase of the delay of a conversation system, a user can pause for a long time after speaking, and can respond, and the problem to be solved in the field of man-machine interaction is still urgent because the problem is solved in the field of man-machine interaction when the real-time performance is ensured and the man-machine interaction process is prevented from being interrupted in advance due to false triggering due to the fact that the VAD technology which is more suitable for a man-machine conversation scene is improved due to the limitation of the real-time performance requirement of the conversation system.

Fig. 1 is a schematic flow chart of a voice endpoint detection method provided by the invention, and as shown in fig. 1, the voice endpoint detection method provided by the invention can be applied to various common voice recognition scenes, such as conference transfer and intelligent customer service scenes, and can also be applied to dialog scenes which need to understand semantics in real time and have strict requirements on false triggering of noise. The method comprises the following steps:

step 110, obtaining real-time transcribed text of the voice data stream and a voice segment of the voice data stream.

Here, the voice data stream is a data stream obtained by recording in real time, and the real-time recording may be voice recording or video recording, which is not particularly limited in the embodiment of the present invention.

The voice data stream can be recorded in real time, and simultaneously, voice recognition can be carried out on the recorded voice data stream in real time, so that a real-time transcription text of the voice data stream is obtained. The real-time transfer text directly reflects the speaking content of the user in the voice data stream, and the real-time voice recognition of the voice data stream is carried out at the same time of the recorded voice data stream, so that the processing time is not additionally occupied, and the method is efficient and simple.

The duration of the voice segment in the voice data stream, that is, the segment of data obtained after the voice data stream recorded in real time is intercepted, is known, and the duration of each voice segment intercepted by the voice endpoint detection method for the voice data stream recorded in real time in the operation process can be the same or different. For example, the duration of the voice segment may be preset, and during the real-time recording process, the voice data stream is intercepted once every preset duration, so as to obtain a newly recorded voice segment with a preset duration.

Step 120, performing silence detection on the speech segment based on the semantic features of the real-time transcribed text and the speech features of the speech segment, to obtain a silence detection sequence of the speech segment, where the silence detection sequence indicates that a plurality of continuous segments in the speech segment are active speech or silence.

Specifically, the real-time transcription text is derived from the real-time recorded voice data stream, and the voice segments are also derived from the real-time recorded voice data stream, so that the user speaking content contained in the real-time transcription text necessarily covers the user speaking content in a segment of the voice segments in the voice data stream.

Unlike the general VAD technology, when the silence detection is performed on a voice section, only from the perspective of voice characteristics, when the silence detection is performed on the voice section, the embodiment of the invention considers not only the voice characteristics of the voice section, but also the semantic characteristics of the real-time transcribed text of the speech content of the user which can cover the voice section. The silence detection mode of combining the voice features and the semantic features ensures that when judging whether each time in a voice segment is silence or active voice, the silence detection mode not only depends on the acoustic information such as sound intensity, loudness, pitch and the like reflected by the voice features, but also refers to the semantic information such as whether semantic content exists, whether the existing semantic content is related to a conversation topic and the like reflected by the semantic features, so that silence detection is realized, noise interference can be resisted, and voice segments which are irrelevant to the required semantics or have no specific semantics can be further filtered.

The resulting silence detection results are represented in the form of a silence detection sequence that naturally divides the speech segment into several consecutive segments and sequentially identifies each segment as either active speech or silence. It should be noted that, for the case of dividing a speech segment into several segments, in the embodiment of the present invention, the duration of each segment obtained by dividing in a single speech segment is equal by default.

Furthermore, the silence detection combining the semantic features and the voice features can be realized through a pre-trained neural network model, for example, the semantic features and the voice features can be input into the pre-trained neural network model for silence detection, the semantic features and the voice features can be fused, and the fused features are input into the pre-trained neural network model for silence detection. The neural network model for silence detection may be a structure of an encoder+a decoder, where semantic features and speech features are encoded and fused by the encoder, and the fused features are decoded by the decoder to output a silence detection sequence, and may also be a decoder, where the features are fused and decoded during decoding.

Step 130, performing voice endpoint detection on the voice data stream based on the silence detection sequence of the voice segment.

Specifically, since the duration of the voice segment itself is known, after the silence detection sequence of the voice segment is obtained, the duration of each segment in the voice segment can be obtained, so as to determine the duration of active voice or silence in the voice segment. The application of the voice segment makes up the problem that the silence detection output sequence cannot represent an accurate time boundary, so that the silence detection sequence can be aligned with a time axis.

On the basis, the voice end point detection of the voice data stream can be realized by combining the duration time of active voice or silence of continuous voice segments in the voice data stream, so that the head end point and the tail end point of effective voice segments possibly contained in the voice data stream are determined, and the effective voice segments are conveniently output for subsequent conversation.

The method provided by the embodiment of the invention obtains the real-time transcription text of the voice data stream through real-time voice recognition, provides semantic features for silence detection as references, simultaneously considers the operation efficiency of voice endpoint detection, and is beneficial to realizing real-time and low-power consumption voice endpoint detection. Silence detection combines voice characteristics and semantic characteristics, can greatly improve the anti-interference capability of voice endpoint detection, filters voice fragments without specific semantics or irrelevant semantics, and avoids the problem of advanced interruption of human-computer interaction process caused by false triggering. And the silence detection sequence is used for integrally representing the silence detection result of each segment in the voice segment, so that noise interference can be further dealt with compared with silence detection at the voice frame level, and the reliability of voice endpoint detection is ensured.

Based on the above embodiment, step 120 includes:

character decoding is carried out on the content characteristics of the voice segment, and the character decoding result of the voice segment is determined to be a silence detection sequence; the content features are obtained by fusing semantic features of the real-time transcribed text and voice features of the voice segments.

Specifically, since the speech segment itself has time sequence, the silence detection process is also a process of serializing output. In the embodiment of the invention, the process of silence detection on the voice segment can be realized by character decoding on the content features fused with the semantic features of the real-time transfer text and the voice features of the voice segment. The character decoding may be implemented by referring to a decoder in a general text generation task, for example, text translation, abstract generation, and the like, and all involve a scheme of performing character decoding based on the feature obtained by encoding to generate a target text, for example, the character decoding for the content feature may be implemented by a decoder in a structure of an encoder+a decoder.

The content features used for character decoding can be obtained by carrying out feature coding fusion on the semantic features of the real-time transcribed text and the voice features of the voice segments before character decoding, for example, the semantic features and the voice features can be directly added to be used as the content features, or the semantic features and the voice features can be spliced to be used as the content features; the semantic features can be fused into the attention mechanism in the decoding process for addition in the character decoding process, so that the voice features required by the current decoding are fused with the semantic features of the real-time transcribed text in the decoding process of each character, and the content features required by the current decoding obtained by fusion are decoded.

In the character decoding result obtained by character decoding, each character corresponds to one segment in the voice segment, and each character is used for representing that the corresponding segment is active voice or mute. For example, the character decoding result of a speech segment may be "speech |silence| meaning that the speech segment may be uniformly divided into three segments, where the first segment is an active speech segment, the second segment is a silence segment, and the third segment is an active speech segment.

Based on any of the above embodiments, fig. 2 is a flowchart of step 120 in the voice endpoint detection method provided by the present invention, and as shown in fig. 2, step 120 includes:

step 121, performing attention conversion on the voice feature based on the semantic feature and the decoding state of the current decoding moment to obtain the voice context feature of the current decoding moment;

step 122, determining the content characteristics of the current decoding moment based on the voice context characteristics of the current decoding moment;

step 123, performing character decoding based on the content characteristics of the current decoding moment to obtain a decoding result of the current decoding moment;

the decoding state of the current decoding moment is determined based on the decoding state of the last decoding moment and the decoding result, and the character decoding result is the decoding result of the final decoding moment.

Specifically, considering that a speech segment is a segment of speech in a speech data stream, and a real-time transcription text covers the overall semantics of the speech data stream, the semantic features of the real-time transcription text reflect not only the semantic information contained in the speech segment, but also the semantic information contained in the speech data preceding the speech segment in the speech data stream. If only the semantic features of the real-time transcribed text and the speech features of the speech segment are added or spliced, the semantic information contained in the speech segment cannot be distinguished from the semantic information contained in the speech data preceding the speech segment, and the content features thus obtained are not reasonable. Therefore, in the embodiment of the invention, semantic features are fused into the attention mechanism in the decoding process for addition in the character decoding process, so that the voice features required by the current decoding are fused with the semantic features of the real-time transcribed text in the decoding process of each character, and the characterization capability of the fused content features on two layers of voice and semantic is improved.

Further, in the process of character decoding, attention conversion can be performed on the voice features of the voice segment by combining the semantic features of the real-time transcribed text according to the decoding state of the current decoding moment, so that the voice features, namely the voice context features, after attention adjustment according to the semantic information and the decoding state of the current decoding moment are obtained. The decoding status at any decoding time includes history information generated in the decoding process before the decoding time.

Here, each feature in the voice context feature is subjected to strength adjustment based on the semantic information and the historical state in the decoding process, the voice feature corresponding to the semantic information related to the conversation is enhanced in the voice context feature obtained through adjustment, the voice feature corresponding to the semantic information unrelated to the conversation or the non-specific semantic information is weakened, so that when the voice context feature is applied to character decoding subsequently, the part of the active voice and the mute part can be distinguished more easily, and the accuracy and the reliability of character decoding are improved.

After obtaining the voice context feature at the current decoding time, the voice context feature may be directly decoded as the content feature at the current decoding time, or the voice context feature may be fused with the voice feature to be decoded at the current decoding time determined based on the previous decoding state and the decoding result to be decoded as the content feature.

After the content feature of the current decoding time is obtained, character decoding can be performed based on the content feature of the current decoding time, for example, the content feature of the current decoding time, the decoding state and the decoding result of the last decoding time can be combined for decoding, so that the reliability of character decoding is improved. After the character decoding at the current decoding moment is completed, the character decoded and output at the current decoding moment can be obtained, and the character decoded and output at the current decoding moment is spliced with the decoding result at the last decoding moment, so that the decoding result at the current decoding moment can be obtained. For example, if the decoding result at the last decoding time is "silent", and the character decoded and output at the current decoding time is "silent", the "silent" may be spliced after the "silent" to obtain the decoding result "silent" at the current decoding time.

According to the method provided by the embodiment of the invention, semantic information is fused in the decoding process, so that the accuracy of silence detection is improved, and the accuracy of voice endpoint detection is improved.

Based on any of the above embodiments, step 121 includes:

and weighting and fusing each frame characteristic based on the attention weight of each frame characteristic to obtain the voice context characteristic at the current decoding moment.

Specifically, for the current decoding moment, attention interaction can be performed on the semantic feature of the real-time transcribed text and the decoding state of the current decoding moment through an attention mechanism, so as to obtain the attention weight of the frame feature of each frame of voice in the voice feature, which can be expressed as the following form:

α′ _t ＝softmax(v ^T *tanh(q _t +K))

wherein t represents the current decoding time, α' _t Attention weight for each frame feature in speech features at current decoding time, q _t K is a semantic feature for the decoding state at the current decoding time. v ^T Is a preset weight matrix.

After the attention weight of each frame feature in the voice features is obtained, weighting and fusing can be performed on each frame feature through the attention weight, so as to adjust the intensity of each frame feature in the voice features, thereby obtaining the voice context feature at the current decoding time, which can be expressed as the following form:

c′ _t ＝∑α′ _t *h _t

Wherein, c' _t I.e. the speech context feature, h, of the current decoding moment _t Each frame characteristic in the voice characteristic at the current decoding moment.

Based on any of the above embodiments, step 122 includes:

and determining the content characteristics of the current decoding moment by combining the voice context characteristics of the current decoding moment and the voice decoding characteristics of the current decoding moment.

Here, the speech decoding feature at the current decoding time may be a speech feature to be decoded at the current decoding time in the case of character decoding in the case of ignoring the semantic feature. The speech decoding characteristics at the current decoding time may be obtained by adjusting the speech characteristics of the speech segment based on the decoding status and decoding result at the previous decoding time.

For example, the speech context feature at the current decoding time and the speech decoding feature at the current decoding time may be spliced and then used as the content feature at the current decoding time, the two may be added to be used as the content feature at the current decoding time, and the two may be spliced and then further extracted to obtain the content feature at the current decoding time.

The manner of adding the content features can be expressed as follows:

C _new ＝c _t +c′ _t

Wherein C is _new C is the content feature at the current decoding time _t Is the speech decoding feature at the current decoding time.

Based on any of the above embodiments, fig. 3 is a flowchart illustrating a step 130 in the voice endpoint detection method according to the present invention, and as shown in fig. 3, the step 130 includes:

step 131, determining the time boundary of each segment in the speech segment based on the duration of the speech segment and the length of the silence detection sequence.

Specifically, considering that the silence detection sequence obtained in step 120 only represents that each segment in the speech segment is active speech or silence, and cannot represent the corresponding position of each segment in the speech segment on the time axis, it is necessary to solve the time boundary of each segment in the speech segment.

For the case of dividing a speech segment into several segments, in the embodiment of the present invention, the duration of each segment divided in a single speech segment is equal by default. Because the duration of the voice segment is known, after the silence detection sequence of the voice segment is obtained, the duration of each segment can be determined based on the duration of the voice segment and the length of the silence detection sequence, and then the time boundary of each segment can be determined according to the position of each segment in the voice segment.

For example, assume that the speech segments are each 40 frames long, the silence detection sequences of three continuous voice segments are spliced together to form a voice/silence/speech/silence, the first speech segment is decoded to obtain 4 words of 'words of |quiet|words', so that the duration of each segment in the first speech segment is 40/4=10 frames, and the time boundaries of the 4 segments are respectively: 0-10 frames, 10-20 frames, 20-30 frames, 30-40 frames. The second speech segment only decodes two words of the word "speech", and the duration of each segment in the second speech segment is 40/2=20 frames, and the time boundaries of the two segments are respectively: 40-60 frames, 60-80 frames.

Step 132, performing voice endpoint detection on the voice data stream based on the silence detection sequence of each voice segment and the time boundary of each segment in the voice data stream.

Specifically, after the time boundaries of the segments in each speech segment are obtained, the speech endpoint detection can be performed based on the active speech or silence corresponding to the segment represented by the silence detection sequence of each speech segment and the time boundaries of the segments in each speech segment. The voice endpoint detection can be implemented based on preset detection rules of the head endpoint and the tail endpoint, and specific detection rules can be adjusted according to specific occasions to which the voice endpoint detection is applied, which is not particularly limited in the embodiment of the invention.

Based on any of the above embodiments, the voice endpoint detection on the voice data stream in step 132 may be specifically divided into two parts, namely, performing the head endpoint detection and the tail endpoint detection, which are performed:

adjacent segments of the same type may be integrated prior to detection, after which the time boundaries of the segments and the type of the segments in the speech data stream are obtained. The type of segment here is active speech or silence.

For head end point detection, active speech that detects a continuous start duration may be determined as the start of active speech, thereby locating the head end point. Here, the starting time period may be preset, for example, 20 frames, and for example, 15 frames. Taking the starting time length of 20 frames as an example, assuming that 0-10 frames are active speech segments and 10-20 frames are silence segments, since the frame length of 0-10 frames is smaller than 20 frames, 0-20 frames belong to non-speech segments. From the 20 th frame, the continuous 60 frames are all active voice, the frame length 60 exceeds 20 frames, so that the front end point of the effective voice segment can be judged to be detected by the 20 th frame, namely the end of the first 40 frames, and the segments of 20-80 frames are effective voice segments.

In addition, for the detection of the valid voice segment, some auxiliary means may be set, for example, a 30-frame silence protection frame policy may be set, for example, after 20 active voice segments are detected, if the duration of the connected silence voice segment does not exceed 30 frames, the valid voice segment may be considered to be still continuous, so as to ensure that false triggering does not occur when a pause occurs after the user speaks one or two words and speaks the next word.

For tail end point detection, if silence for a continuous termination period is detected on the basis that a head end point has been detected, it is determined that active speech ends, thereby locating the tail end point. Here, the termination time period may be preset, for example, 30 frames, and for example, 40 frames. Taking the example of a termination duration of 30 frames, assuming that there are four consecutive silence segments of 10 frames in length starting from frame 80, and the total duration is 40 frames, the tail end point can be determined at the time of the third silence segment, and thus the 80 th to 120 th frames are determined to be silence segments.

Based on any of the above embodiments, fig. 4 is a schematic flow chart of real-time speech recognition in step 110 in the speech endpoint detection method provided by the present invention, and as shown in fig. 4, in step 110, obtaining a real-time transcription text of a speech data stream includes:

step 111, based on the audio energy of each voice frame in the voice data stream, performing silence segment filtering on the voice data stream;

and 112, performing real-time transcription on the voice data stream filtered by the mute segment to obtain a real-time transcription text.

Specifically, considering that more calculation resources are required for performing voice recognition on a voice data stream recorded in real time, a filtering step can be added before performing voice recognition, whether each voice frame possibly belongs to a mute segment is judged by acquiring the audio energy of each voice frame in the voice data stream recorded in real time and taking the size of the audio energy as a basis, then the voice frames possibly belonging to the mute segment in the voice data stream are filtered, and only the voice frames remaining after filtering are subjected to real-time transcription, so that the data volume of real-time transcription is reduced, and the requirement of real-time transcription on calculation resources is further reduced. For example, an energy threshold and a preset frame number may be preset, and if the number of speech frames whose audio energy is continuously lower than the energy threshold exceeds the preset frame number, the segment where the speech frame is located is determined to be a mute segment, and filtering is performed.

After that, the voice data stream is subjected to real-time transcription, the acoustic characteristics of each voice frame in the voice data stream can be extracted first, and then decoding is performed based on a real-time acoustic model, so that a corresponding decoded text is obtained as a real-time transcription text. In a specific transcription process, considering the consistency requirement of text semantics, semantic information in text obtained by decoding too short sliding window audio may be seriously lost, and preferably, when acoustic feature extraction is performed on a voice data stream, acoustic features can be cumulatively extracted in a non-sliding window mode, for example, a Filter Bank or MFCC (Mel-scale Frequency Cepstral Coefficients, mel cepstrum coefficient) feature and the like can be applied to determine the acoustic feature of each voice frame.

According to the method provided by the embodiment of the invention, the mute segment is filtered through the audio energy, so that a large amount of voice decoding calculation amount is saved.

Based on any of the above embodiments, step 111 may be implemented in the following form:

presetting two energy thresholds including a lower energy threshold P _low And a higher energy threshold P _hig 。

When currently in silence segment or beginning silence detection:

if the audio energy P of the current speech frame <P _low The method can directly jump to the mute segment;

if the audio energy P of the current voice frame is more than or equal to P _low And P is<P _high At that point, a transition segment may be skipped.

When currently in transition piece:

if the audio energy P of the current speech frame falls back to P _low The following may jump to a mute segment;

if the audio energy P of the current voice frame is more than or equal to P _high Then a jump can be made to the speech segment and the speech segment starts.

When currently in a speech segment:

if at presentAudio energy P of speech frames falls back to P _low When the frame is below and continuously exceeds M frames, the method can jump to a mute segment and end the voice segment;

if the audio energy P of the preceding speech frame falls back to P _low Below, but not for more than M frames, then the speech segment can be maintained and monitoring continued.

Based on any of the above embodiments, the starting point of the voice data stream is the tail end point of the last valid voice segment.

In particular, the real-time recording of voice data streams is not always continuous without interruption. In contrast, considering that the speaking contents represented between the valid voice segments are relatively independent, in the process of detecting the voice endpoint, after detecting a tail end point, one valid voice segment can be considered to end, and the content contained in the voice data stream recorded later is irrelevant to the content contained in the valid voice segment recorded before the end of recording, so that the tail end point of the last valid voice segment is used as the starting point of restarting the recorded voice data stream.

The resulting speech data stream does not contain the content of the previously recorded valid speech segments, and therefore the semantic features referred to in silence detection of speech segments are not related to the content of the previously recorded valid speech segments. By using the tail end point of the last valid voice segment as the starting point of the voice data stream recorded by restarting, the content irrelevant to the current voice data stream is filtered, thereby being beneficial to improving the reliability of silence detection.

Based on any of the above embodiments, fig. 5 is a flow chart of the voice endpoint detection method provided by the present invention, as shown in fig. 5, for a voice data stream recorded in real time, the voice data stream may be divided into two branches for processing respectively.

One path of the audio is fed into the decoding after the last effective voice segment, in other words, the tail end point of the last effective voice segment is used as the starting point of the voice data stream, and the voice data stream recorded in real time is fed into the decoding, so that real-time voice transcription is realized. In this process, the acoustic feature of each voice frame in the voice data stream may be first extracted, and then, based on the audio energy of each voice frame, the voice data stream is subjected to silence segment filtration, and the acoustic feature of each voice frame in the voice data stream after silence segment filtration is input into the real-time acoustic model for decoding, so as to obtain the real-time transcription text of the voice data stream.

On the basis, the semantic extraction can be performed on the real-time transcribed text, so that the semantic features of the real-time transcribed text are obtained, namely, the real-time transcribed text is converted into a high-dimensional vector expression. The specific extraction mode can map each word in the real-time transfer text into an enabling vector, extract hidden layer vectors of each word through structures such as a long-short-time memory network and a cyclic neural network, and splice the hidden layer vectors to obtain semantic features.

The other path can be used for accumulating a section of voice segment and then sending the accumulated section of voice segment to decoding, for example, a section of voice segment can be intercepted every 40 accumulated frames as shown in fig. 5 and sent to decoding, and silence detection is carried out by combining semantic features. In this process, the speech features of the speech segment may be first extracted, and further encoding processing may be performed on the speech features on this basis, for example, encoding is performed through an encoding portion to obtain a high-dimensional feature vector encoding vector. The encoding part can be a long-short-time memory network, a cyclic neural network and the like. After that, in the decode section, the silence detection sequence of the speech segment can be output by performing silence detection on the speech segment in combination with the semantic feature and the encode vector obtained by further encoding the speech feature.

After the silence detection sequence is obtained, the time boundaries of the segments in the speech segment can be determined by combining the duration of the speech segment and the length of the silence detection sequence, so as to perform speech endpoint detection. E.g. a concatenated silence detection sequence of three speech segments in fig. 5 is "the language |quiet|the language|the language| the i-language i-static, the first speech segment is decoded to obtain 4 words of 'words of |quiet|words', the duration of each segment in the first speech segment is 40/4=10 frames, and the time boundaries of the 4 segments are respectively: 0-10 frames, 10-20 frames, 20-30 frames, 30-40 frames. The second speech segment decodes two words of the word "speech|", and the duration of each segment in the second speech segment is 40/2=20 frames, and the time boundaries of the two segments are respectively: 40-60 frames, 60-80 frames. Decoding the third speech segment to obtain 4 the individual word "quiet |quiet |quiet|quiet", each segment in the third speech segment has a duration of 40/4=10 frames, and the time boundaries of the 4 segments are respectively: 80-90 frames, 90-100 frames, 100-110 frames, 110-120 frames. In fig. 5, the diagonally filled squares represent "words", and the blank unfilled squares represent "silence". On the basis, the voice endpoint in the voice data stream can be detected by combining the preset endpoint detection rule.

Based on any of the above embodiments, fig. 6 is a schematic structural diagram of a voice endpoint detection apparatus according to the present invention, as shown in fig. 6, the apparatus includes:

a data acquisition unit 610, configured to acquire a real-time transcription text of a voice data stream and a voice segment of the voice data stream;

a silence detection unit 620, configured to perform silence detection on the speech segment based on the semantic feature of the real-time transcribed text and the speech feature of the speech segment, to obtain a silence detection sequence of the speech segment, where the silence detection sequence indicates that a plurality of continuous segments in the speech segment are active speech or silence;

the endpoint detection unit 630 is configured to perform voice endpoint detection on the voice data stream based on the silence detection sequence of the voice segment.

The device provided by the embodiment of the invention acquires the real-time transcription text of the voice data stream through real-time voice recognition, provides semantic features for silence detection as references, simultaneously considers the operation efficiency of voice endpoint detection, and is beneficial to the realization of real-time and low-power consumption voice endpoint detection. Silence detection combines voice characteristics and semantic characteristics, can greatly improve the anti-interference capability of voice endpoint detection, filters voice fragments without specific semantics or irrelevant semantics, and avoids the problem of advanced interruption of human-computer interaction process caused by false triggering. And the silence detection sequence is used for integrally representing the silence detection result of each segment in the voice segment, so that noise interference can be further dealt with compared with silence detection at the voice frame level, and the reliability of voice endpoint detection is ensured.

Based on any of the above embodiments, the silence detection unit 620 is configured to:

Based on any of the above embodiments, the endpoint detection unit 630 is configured to:

Based on any of the above embodiments, the data acquisition unit 610 is configured to:

Fig. 7 illustrates a physical schematic diagram of an electronic device, as shown in fig. 7, which may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a voice endpoint detection method comprising: acquiring a real-time transcription text of a voice data stream and a voice segment of the voice data stream; based on the semantic features of the real-time transcribed text and the voice features of the voice segments, performing silence detection on the voice segments to obtain a silence detection sequence of the voice segments, wherein the silence detection sequence indicates that a plurality of continuous segments in the voice segments are active voice or silence; and detecting the voice endpoint of the voice data stream based on the silence detection sequence of the voice segment.

Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method of speech endpoint detection provided by the methods described above, the method comprising: acquiring a real-time transcription text of a voice data stream and a voice segment of the voice data stream; based on the semantic features of the real-time transcribed text and the voice features of the voice segments, performing silence detection on the voice segments to obtain a silence detection sequence of the voice segments, wherein the silence detection sequence indicates that a plurality of continuous segments in the voice segments are active voice or silence; and detecting the voice endpoint of the voice data stream based on the silence detection sequence of the voice segment.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above-provided voice endpoint detection methods, the method comprising: acquiring a real-time transcription text of a voice data stream and a voice segment of the voice data stream; based on the semantic features of the real-time transcribed text and the voice features of the voice segments, performing silence detection on the voice segments to obtain a silence detection sequence of the voice segments, wherein the silence detection sequence indicates that a plurality of continuous segments in the voice segments are active voice or silence; and detecting the voice endpoint of the voice data stream based on the silence detection sequence of the voice segment.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for detecting a voice endpoint, comprising:

performing voice endpoint detection on the voice data stream based on the silence detection sequence of the voice segment;

the silence detection of the voice segment is performed based on the semantic features of the real-time transcribed text and the voice features of the voice segment to obtain a silence detection sequence of the voice segment, which comprises the following steps:

the content features are obtained by fusing semantic features of the real-time transcribed text and voice features of the voice segments;

the character decoding of the content features of the speech segment comprises:

2. The method for detecting a voice endpoint according to claim 1, wherein performing attention conversion on the voice feature based on the semantic feature and a decoding status of a current decoding time to obtain a voice context feature of the current decoding time comprises:

3. The voice endpoint detection method of claim 1, wherein the performing voice endpoint detection on the voice data stream based on the silence detection sequence of the voice segments comprises:

4. The method of claim 1, wherein the obtaining real-time transcribed text of the voice data stream comprises:

5. The method according to any one of claims 1 to 4, wherein the starting point of the voice data stream is the tail end point of the last valid voice segment.

6. A voice endpoint detection apparatus, comprising:

An endpoint detection unit, configured to perform voice endpoint detection on the voice data stream based on a silence detection sequence of the voice segment;

the silence detection unit is specifically configured to:

the character decoding of the content features of the speech segment comprises:

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the speech endpoint detection method of any of claims 1 to 5 when the program is executed by the processor.

8. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the speech end point detection method according to any of claims 1 to 5.