CN112053686A

CN112053686A - Audio interruption method and device and computer readable storage medium

Info

Publication number: CN112053686A
Application number: CN202010739039.0A
Authority: CN
Inventors: 邢安昊; 陈晓宇; 雷欣; 李志飞
Original assignee: Mobvoi Information Technology Co Ltd
Current assignee: Mobvoi Information Technology Co Ltd
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2020-12-08
Anticipated expiration: 2040-07-28
Also published as: CN112053686B

Abstract

The invention discloses an audio interruption method, an audio interruption device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a plurality of feature vector data of the audio data; generating, for a plurality of the feature vector data, a confidence level for characterizing the audio data as a particular audio; stopping the output of the current audio information according to the generated confidence. Therefore, the output of the current audio information is determined to be stopped by using the confidence coefficient generated by the feature vector data, and a recognition result does not need to be obtained by a speech recognition decoder in the prior art, so that the calculation amount is greatly reduced, the interruption delay is reduced, and the user experience is improved.

Description

Audio interruption method and device and computer readable storage medium

Technical Field

The present invention relates to the field of speech processing, and in particular, to an audio interrupt method and apparatus, and a computer-readable storage medium.

Background

The existing interruption technology is mainly applied to intelligent customer service conversation, namely, a user can interrupt the speech of the robot at any time in the process of the speech of the robot. However, the recognition result of the ASR system is delayed greatly, and the delay is close to 1s from the time when the user starts speaking to the time when the interruption event is triggered, so that the intelligent customer service still performs TTS (text to speech) broadcasting within 1s after the interruption, and the interrupted user experience is influenced.

Disclosure of Invention

The embodiment of the invention provides an audio interruption method, an audio interruption device and a computer-readable storage medium, which have the technical effects of reducing interruption delay and improving user experience.

One aspect of the present invention provides an audio interruption method, including: acquiring a plurality of feature vector data of the audio data; generating, for a plurality of the feature vector data, a confidence level for characterizing the audio data as a particular audio; stopping the output of the current audio information according to the generated confidence.

In an embodiment, the obtaining the plurality of feature vector data of the audio data includes: extracting a plurality of continuous audio fragment data in the audio data in a streaming manner; and respectively extracting the characteristics of the plurality of audio fragment data to generate a plurality of characteristic vector data.

In one embodiment, the plurality of consecutive audio clip data are extracted at equal intervals, and adjacent audio clip data are partially overlapped with each other.

In an embodiment, the generating, for a plurality of the feature vector data, a confidence for characterizing the audio data as a specific audio includes: respectively generating a probability value for representing the feature vector data as preset classification information aiming at each feature vector data; generating a confidence coefficient for representing the audio data as specific audio according to the probability value corresponding to each feature vector data

In an embodiment, the generating, for each of the feature vector data, a probability value for characterizing the feature vector data as preset classification information includes: and respectively inputting each feature vector data into a classifier model for training, and respectively outputting a probability value for representing the feature vector data as preset classification information.

In an embodiment, the classifier model is a two-classifier model, and the preset classification information is human information.

In an embodiment, the generating a confidence level for characterizing the audio data as a specific audio according to the probability value corresponding to each of the feature vector data includes: counting in a streaming manner a number of at least some of the probability values exceeding a probability threshold; and if the counted number is judged to exceed the specified number threshold, generating a confidence coefficient for representing the audio data as the specific audio according to the probability value of the participated statistics.

In an embodiment, the generating a confidence level for characterizing the audio data as a specific audio according to the probability value of the participated statistics comprises: selecting probability values exceeding the probability threshold value from the probability values participating in statistics; and calculating the geometric mean value of the selected probability value to generate confidence coefficient, wherein the calculation formula is as follows:

Con＝0(M＜T_c) (ii) a Where Con represents the confidence level, M represents the number of probability values that exceed the probability threshold, p_iA probability value, T, representing the feature vector data as preset classification information_pRepresenting a probability threshold, T_cIndicating a specified number threshold.

Another aspect of the present invention provides an audio interrupting device, comprising: the characteristic acquisition module is used for acquiring a plurality of characteristic vector data of the audio data; a confidence generating module, configured to generate, for a plurality of the feature vector data, a confidence for characterizing the audio data as a specific audio; and the confidence coefficient execution module is used for stopping the output of the current audio information according to the generated confidence coefficient.

Another aspect of the invention provides a computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform any of the audio interruption methods described above.

In the embodiment of the invention, the output of the current audio information is determined to stop by utilizing the confidence coefficient generated by the feature vector data, and the recognition result is not required to be obtained by a speech recognition decoder in the prior art, so that the calculation amount is greatly reduced, the interruption delay is reduced, and the user experience is improved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Fig. 1 is a schematic flow chart illustrating an implementation of an audio interruption method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a relationship between adjacent audio clip data according to an audio interruption method of the present invention;

fig. 3 is a schematic structural diagram of an audio interrupt device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart illustrating an implementation of an audio interruption method according to an embodiment of the present invention.

As shown in fig. 1, an aspect of the present invention provides an audio interruption method, including:

step 101, acquiring a plurality of feature vector data of audio data;

step 102, generating a confidence coefficient for representing the audio data as a specific audio for the plurality of feature vector data;

and 103, stopping outputting the current audio information according to the generated confidence coefficient.

In this embodiment, in step 101, the audio data may be acquired by an audio acquisition device, such as a voice recorder or a microphone, and the audio data may be a voice of a human being, a voice of an animal, or a natural sound.

In step 102, the specific audio may be one of a human voice, an animal voice, or a natural sound, and may be specified in advance according to the actual application.

In step 103, the confidence level is used to indicate the reliability of the audio data as a specific audio, and the higher the confidence level, the higher the probability that the audio data is a specific audio. The current audio information is mainly output by a machine end or an equipment end, and when the confidence coefficient meets a certain condition, the output of the current audio information is stopped.

Therefore, the output of the current audio information is determined to be stopped by using the confidence coefficient generated by the feature vector data, and a recognition result does not need to be obtained by a speech recognition decoder in the prior art, so that the calculation amount is greatly reduced, the interruption delay is reduced, and the user experience is improved.

When the method is applied to an intelligent customer service conversation scene, the intelligent customer service can immediately stop current audio output and continue to receive the sound of a user when the equipment end judges that the received audio data is the voice.

The method can also be applied to audio output equipment, for example, in the process that the audio output equipment such as a vehicle-mounted sound box is playing, if the whistling sound around the vehicle is received, the current playing is stopped, so that a driver can hear the whistling sound, and the driving safety is improved.

In one embodiment, obtaining a plurality of feature vector data of audio data comprises:

extracting a plurality of continuous audio fragment data in the audio data in a streaming manner;

features of the plurality of pieces of audio segment data are extracted, respectively, to generate a plurality of pieces of feature vector data.

In this embodiment, the specific process of step 101 is as follows:

extracting a plurality of continuous audio fragment data from the audio data in an order from a head data node to a tail data node;

then, the mfccs (mel Frequency Cepstral coefficients) features or FilterBank features are extracted for each audio fragment data, and a plurality of feature vector data are generated.

Fig. 2 is a schematic diagram illustrating a relationship between adjacent audio clip data in an audio interruption method according to an embodiment of the present invention.

In one embodiment, the extraction time intervals of a plurality of consecutive audio clip data are equal, and the data overlap between adjacent audio clip data.

In the present embodiment, as shown in fig. 2, the time interval is preferably one frame time, i.e., 25 ms. In order to avoid the omission of audio data, it is preferable to extract one frame every 10ms in the extraction, so that the data of adjacent audio segments overlap with each other, and the shaded portion in fig. 2 is the overlapping portion.

In an embodiment, for each feature vector data, generating a confidence for characterizing the audio data as a specific audio comprises:

respectively generating probability values for representing the feature vector data as preset classification information aiming at each feature vector data;

and generating a confidence coefficient for representing the audio data as the specific audio according to the probability value corresponding to each feature vector data.

In this embodiment, the specific process of step 102 is:

and judging and generating the probability value of the feature vector data as preset classification information aiming at each feature vector data, wherein the preset classification information can be set according to practical application, for example, when the preset classification information is applied to intelligent customer service conversation, the preset classification information is human voice, and when the preset classification information is applied to vehicle driving, the preset classification information is whistling.

And generating confidence coefficient for representing the audio data as specific audio according to the probability value of each feature vector data.

In an implementation manner, for each piece of feature vector data, respectively generating a probability value for characterizing the feature vector data as preset classification information includes:

and respectively inputting each feature vector data into the classifier model for training, and respectively outputting probability values for representing the feature vector data as preset classification information.

In this embodiment, the specific steps of generating the probability value are as follows:

and respectively outputting each feature vector to a classifier model for training, wherein the classifier model can map the data to one of the given classes so as to be applied to data prediction. In a word, the classifier is a general term of a method for classifying samples in data mining, and includes algorithms such as decision trees, logistic regression, naive bayes, neural networks and the like.

When the classifier is applied to the method, the classifier needs to be trained, and generally the following steps are carried out:

1. samples (including positive samples and negative samples) are selected, and all samples are divided into two parts, namely training samples and testing samples.

2. And executing a classifier algorithm on the training samples to generate a classification model.

3. And executing the classification model on the test sample to generate a prediction result.

4. And calculating necessary evaluation indexes according to the prediction result, and evaluating the performance of the classification model.

In one embodiment, the classifier model is a two-classifier model, and the predetermined classification information is human voice information.

In this embodiment, the classifier model is a two-classifier model, and is preferably used for predicting that the feature vector data is human voice information (denoted as p)_i) Probability of (2) and probability of non-human voice information (denoted as q)_i) And satisfy p_i+q_i＝1。

In an embodiment, generating a confidence level for characterizing the audio data as a specific audio according to the probability value corresponding to each feature vector data includes:

counting in a streaming manner the number of at least partial probability values exceeding a probability threshold;

and if the counted number is judged to exceed the specified number threshold, generating a confidence coefficient for representing the audio data as the specific audio according to the probability value of the participated statistics.

In this embodiment, the specific process of generating the confidence coefficient is as follows:

and (3) selecting the probability values of the designated number in a streaming manner according to the sequence of the data stream by using a Sliding window (Sliding window) technology, and judging the number exceeding the probability threshold value in all the selected probability values, wherein the probability threshold value can be set in advance.

If the number exceeding the probability threshold exceeds a specified number threshold, which can be set in advance, the "steady-state audio" is considered to be detected, and then a confidence coefficient for representing the audio data as the specific audio is generated according to the probability value of the participated statistics.

In an embodiment, generating a confidence level for characterizing the audio data as the specific audio according to the probability value of the participated statistics comprises:

selecting probability values exceeding a probability threshold value from the probability values participating in statistics;

and calculating the geometric mean value of the selected probability value to generate confidence coefficient, wherein the calculation formula is as follows:

Con＝0(M＜T_c)；

where Con represents confidence, M represents the number of probability values that exceed a probability threshold, p_iProbability value T representing feature vector data as preset classification information_pRepresenting a probability threshold, T_cIndicating a specified number threshold.

In this embodiment, when the calculated confidence is higher than the specified value, it is determined that the audio data is the specific audio, and the output of the current audio information is stopped.

As shown in fig. 3, another aspect of the present invention provides an audio interrupting device, comprising:

a feature obtaining module 201, configured to obtain a plurality of feature vector data of the audio data;

a confidence generating module 202, configured to generate, for the plurality of feature vector data, a confidence for characterizing the audio data as a specific audio;

and the confidence coefficient executing module 203 is used for stopping the output of the current audio information according to the generated confidence coefficient.

In this embodiment, in the feature obtaining module 201, the audio data may be collected through an audio collecting device, such as a recording pen or a microphone, and the audio data may specifically be a human voice, a sound of an animal, or a natural sound.

In the confidence generating module 202, the specific audio may also be one of human voice, animal voice or natural sound, and may be specified in advance according to the actual application.

In the confidence performing module 203, the confidence is used to indicate the reliability of the audio data as a specific audio, and the higher the confidence is, the higher the probability that the audio data is a specific audio is. The current audio information is mainly output by a machine end or an equipment end, and when the confidence coefficient meets a certain condition, the output of the current audio information is stopped.

When the device is applied to an intelligent customer service conversation scene, the intelligent customer service can immediately stop current audio output and continue to receive the sound of a user when the equipment end judges that the received audio data is the voice.

The device can also be applied to audio output equipment, for example, in the process that the audio output equipment such as a vehicle-mounted sound box is playing, if the whistling sound around the vehicle is received, the current playing is stopped, so that a driver can hear the whistling sound, and the driving safety is improved.

In an implementation manner, the feature obtaining module 201 is specifically configured to:

In this embodiment, a plurality of consecutive audio fragment data are extracted from the audio data in the order from the head data node to the tail data node;

In an implementation, the confidence generation module 202 is specifically configured to:

In this embodiment, for each piece of feature vector data, the probability value that the generated feature vector data is the preset classification information is determined, where the preset classification information may be set according to practical applications, for example, when the preset classification information is applied to an intelligent customer service conversation, the preset classification information is a human voice, and when the preset classification information is applied to vehicle driving, the preset classification information is a whistling sound.

The confidence generating module 202 specifically includes the following steps in generating the probability value:

The specific process of the confidence generation module 202 in generating the confidence is as follows:

If the number exceeding the probability threshold exceeds the specified number threshold, considering that the steady-state audio is detected, and selecting the probability value exceeding the probability threshold from the probability values of the participated statistics; and calculating the geometric mean value of the selected probability value to generate confidence coefficient, wherein the calculation formula is as follows:

Con＝0(M＜T_c)；

When the confidence coefficient execution module 203 determines that the calculated confidence coefficient is higher than the specified value, it determines that the audio data is a specific audio, and stops outputting the current audio information.

Another aspect of the invention provides a computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform an audio interrupt method.

In an embodiment of the invention, a computer-readable storage medium includes a set of computer-executable instructions that, when executed, obtain a plurality of feature vector data for audio data; generating, for a plurality of feature vector data, a confidence level for characterizing the audio data as a particular audio; stopping the output of the current audio information according to the generated confidence.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An audio interruption method, the method comprising:

acquiring a plurality of feature vector data of the audio data;

generating, for a plurality of the feature vector data, a confidence level for characterizing the audio data as a particular audio;

stopping the output of the current audio information according to the generated confidence.

2. The method of claim 1, wherein obtaining the plurality of feature vector data for the audio data comprises:

and respectively extracting the characteristics of the plurality of audio fragment data to generate a plurality of characteristic vector data.

3. The method according to claim 2, wherein the plurality of consecutive audio clip data are extracted at equal time intervals, and the adjacent audio clip data are overlapped with each other by a part of data.

4. The method of claim 1, wherein the generating, for a plurality of the feature vector data, a confidence level for characterizing the audio data as a particular audio comprises:

respectively generating a probability value for representing the feature vector data as preset classification information aiming at each feature vector data;

and generating a confidence coefficient for representing the audio data as specific audio according to the probability value corresponding to each feature vector data.

5. The method of claim 4, wherein the generating, for each of the feature vector data, a probability value for characterizing the feature vector data as preset classification information comprises:

and respectively inputting each feature vector data into a classifier model for training, and respectively outputting a probability value for representing the feature vector data as preset classification information.

6. The method of claim 5, wherein the classifier model is a two-classifier model, and the preset classification information is human voice information.

7. The method of claim 4, wherein the generating a confidence level for characterizing the audio data as a specific audio according to the probability value corresponding to each of the feature vector data comprises:

counting in a streaming manner a number of at least some of the probability values exceeding a probability threshold;

8. The method of claim 7, wherein generating a confidence level for characterizing the audio data as a specific audio according to the probability values of the participating statistics comprises:

selecting probability values exceeding the probability threshold value from the probability values participating in statistics;

where Con represents the confidence level, M represents the number of probability values that exceed the probability threshold, p_iA probability value, T, representing the feature vector data as preset classification information_pRepresenting a probability threshold, T_cIndicating a specified number threshold.

9. An audio interruption device, the device comprising:

the characteristic acquisition module is used for acquiring a plurality of characteristic vector data of the audio data;

a confidence generating module, configured to generate, for a plurality of the feature vector data, a confidence for characterizing the audio data as a specific audio;

and the confidence coefficient execution module is used for stopping the output of the current audio information according to the generated confidence coefficient.

10. A computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform the audio interruption method of any of claims 1-8.