CN116741160A

CN116741160A - Streaming voice recognition method and device and electronic equipment

Info

Publication number: CN116741160A
Application number: CN202310627778.4A
Authority: CN
Inventors: 李强; 王志铭
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-05-30
Filing date: 2023-05-30
Publication date: 2023-09-12

Abstract

The embodiment of the specification discloses an end-to-end flow type voice recognition method, which comprises the following steps: acquiring a voice stream to be recognized; inputting the voice stream to be recognized into a pre-trained diffusion model, and predicting at least one future voice frame of the voice stream to be recognized through a back diffusion process of the diffusion model; splicing the acquired at least one frame of the future voice frame with the voice stream to be recognized to obtain a spliced voice stream; and performing voice recognition on the spliced voice stream to obtain a voice recognition result. Accordingly, the invention discloses an end-to-end streaming voice recognition device.

Description

Streaming voice recognition method and device and electronic equipment

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for stream type speech recognition, and an electronic device.

Background

In streaming speech recognition, small amounts of future speech information are typically added to increase the contextual speech information in the recognition process. These future speech information typically employ real underlying audio information, resulting in the recognition system speaking words later than the actual user.

Disclosure of Invention

One of the purposes of the invention is to provide an end-to-end stream voice recognition method, which can increase semantic context information by predicting future voice frames to replace real context audio information, reduce delay rate in the voice recognition process and improve accuracy of real-time voice recognition.

According to the above object, an embodiment of the present disclosure provides an end-to-end streaming voice recognition method, including:

acquiring a voice stream to be recognized;

inputting the voice stream to be recognized into a pre-trained diffusion model, and predicting at least one future voice frame of the voice stream to be recognized through a back diffusion process of the diffusion model;

splicing the acquired at least one frame of the future voice frame with the voice stream to be recognized to obtain a spliced voice stream;

and performing voice recognition on the spliced voice stream to obtain a voice recognition result.

In the embodiment of the specification, a new end-to-end stream voice recognition method is provided, future voice frames of a voice stream to be recognized are predicted through a pre-trained diffusion model, the predicted future voice frames are spliced with the spliced voice stream to be recognized, voice recognition is performed on the spliced voice stream obtained by splicing the predicted future voice frames with the voice stream to be recognized, the predicted future voice frames are used for replacing the real future voice frames, the accuracy and the instantaneity of voice recognition are improved while the semantic context is increased, and the problem that the output recognition result of a voice recognition system is delayed is solved.

Further, in some embodiments, the diffusion model is pre-trained as follows:

acquiring a target voice frame;

determining data distribution in the target voice frame, and taking the data distribution in the target voice frame as a training label;

successively and randomly adding Gaussian noise to the target voice frame for a preset number of times to obtain a sample voice stream;

inputting the sample voice stream into a preset diffusion model to obtain a predicted voice frame;

and updating parameters of the neural network in the diffusion model based on the predicted voice frame and the target voice frame.

Still further, in some embodiments, the updating parameters of the neural network in the diffusion model based on the predicted speech frame and the target speech frame specifically includes:

and taking the difference between the data distribution of the target voice frame and the data distribution of the predicted voice frame as a predicted loss function of the diffusion model, training the diffusion model with the predicted loss function of the diffusion model as a target, and updating the parameters of the neural network in the diffusion model.

Further, in some embodiments, before inputting the to-be-recognized voice stream into the pre-trained diffusion model, the method further includes:

performing block processing on the voice stream to be recognized according to a preset time window to obtain at least one block voice stream;

and inputting the segmented voice stream into the diffusion model to obtain a future voice frame corresponding to the segmented voice stream.

Further, in some embodiments, a ratio of the future speech frames to the received speech stream in the spliced speech stream is greater than a preset ratio threshold.

Another object of the present invention is to provide an end-to-end streaming voice recognition apparatus, which can improve the accuracy of voice recognition by predicting the information of future voice frames, and solve the problem of delay of recognition results.

According to the above object, an embodiment of the present disclosure provides an end-to-end streaming voice recognition apparatus, including:

a receiving module configured to receive a voice stream to be recognized;

a future speech prediction module configured to input the speech stream to be recognized into a pre-trained diffusion model, and predict at least one frame of future speech frame of the speech stream to be recognized through a back diffusion process of the diffusion model;

the splicing module is configured to splice the acquired at least one frame of the future voice frame with the voice stream to be recognized to obtain a spliced voice stream;

and the voice recognition module is configured to perform voice recognition on the spliced voice stream to obtain a voice recognition result.

Further, in some embodiments, the future speech prediction module is specifically further configured to perform a blocking process on the speech stream to be identified through a preset time window, so as to obtain at least one blocking speech stream; and inputting the segmented voice stream into the diffusion model to obtain a future voice frame corresponding to the segmented voice stream.

Further, in some embodiments, the splicing module is specifically configured to splice the future speech frame and the speech stream to be identified according to a preset ratio, so as to obtain the spliced speech stream; in the spliced voice stream, the ratio of the future voice frame to the received voice stream is larger than a preset ratio threshold.

The embodiments of the present specification also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described end-to-end streaming voice recognition method.

The embodiment of the specification also provides an electronic device, including:

one or more processors; and

a memory associated with the one or more processors, the memory for storing program instructions that, when read for execution by the one or more processors, perform the end-to-end streaming voice recognition method described above.

The end-to-end stream voice recognition method disclosed by the embodiment of the specification has the beneficial effects that the future voice frames of the voice stream to be recognized are predicted by Gaussian noise in the pre-trained diffusion model, the spliced voice stream obtained by splicing the predicted future voice frames and the voice stream to be recognized is subjected to voice recognition, the predicted future voice frames are used for replacing the real future voice frames, the accuracy and the instantaneity of voice recognition are improved while the semantic context is increased, and the problem that the output recognition result of the voice recognition system is delayed is solved. In addition, the control of the ratio of the future speech frame to the received speech stream cannot be too small, so that the accuracy of speech recognition is ensured.

The end-to-end flow type voice recognition device described in the embodiments of the present disclosure also has the above-mentioned beneficial effects.

Drawings

Fig. 1 schematically illustrates a flow diagram of an end-to-end flow speech recognition method according to an embodiment of the present disclosure in one implementation.

Fig. 2 schematically shows a flowchart of a diffusion model in an end-to-end flow speech recognition method according to an embodiment of the present disclosure, performed in one implementation.

Fig. 3 schematically illustrates an application of the end-to-end flow speech recognition method according to the embodiment of the present specification in a specific implementation manner.

Fig. 4 schematically illustrates a schematic structure of an end-to-end flow speech recognition device according to an embodiment of the present disclosure in one implementation.

Detailed Description

It is first noted that the terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. In addition, the embodiments and features in the embodiments in the present specification may be combined with each other without conflict.

The end-to-end flow speech recognition method and apparatus described in the embodiments of the present specification will be further described in detail below with reference to the accompanying drawings and specific embodiments, but the detailed description is not to be construed as limiting the embodiments of the present specification.

The voice recognition is widely applied in life, common voice messages are converted into characters, voice input which can replace keyboard input, voice-controlled intelligent household appliances and the like are all convenient for life, and the voice recognition system is more in line with daily habits of people through voice transmission information, and is more natural and efficient. Compared with the common voice recognition, the streaming voice recognition requires stronger real-time performance and is generally used in the fields of intelligent customer service, interactive wind control and the like which need dialogue. Because the voice is required to be converted into a text result in real time, the context voice information in the recognition process can be supplemented by acquiring future voice frames, so that the accuracy of voice recognition is ensured, but the adoption of the real future voice frames inevitably leads to the fact that the time for outputting the recognition result is later than the time for actually inputting the voice by a user, and the requirement of high real-time performance cannot be met due to higher delay; if the streaming voice recognition is performed in a low-delay environment, the accuracy of the recognition result is not satisfactory.

The end-to-end voice recognition system represents a voice recognition system constructed by utilizing a neural network model, and the system has strong portability and reusability, and can realize ideal streaming voice recognition with lower cost. In order to achieve both low latency and high accuracy of the streaming speech recognition, the embodiments of the present disclosure implement end-to-end streaming speech recognition based on a diffusion model that is used to predict future speech frames instead of real future speech frames for supplementing contextual speech information.

In one embodiment of the present specification, an end-to-end streaming voice recognition method is presented. Fig. 1 schematically illustrates a flow diagram of an end-to-end flow speech recognition method according to an embodiment of the present disclosure in one implementation.

As shown in fig. 1, includes:

100: and acquiring the voice stream to be recognized.

A section of voice stream to be recognized, which needs to contain recognizable voice content, can be collected through a microphone, and the duration of the voice stream is not too short so as to ensure good recognition effect.

102: inputting the voice stream to be recognized into a pre-trained diffusion model, and predicting at least one future voice frame of the voice stream to be recognized through a back diffusion process of the diffusion model.

The inspiration of the diffusion model is derived from non-equilibrium state thermodynamics, and random noise with preset times is gradually added to input original data in the forward diffusion process by defining a Markov diffusion step chain, and noise data is output; the back-diffusion process, which may also be referred to as a back-generation process, is then learned and trained to construct data samples from the noise data that approximate the actual raw data.

In some embodiments, the diffusion model is pre-trained as follows:

acquiring a target voice frame;

and successively and randomly adding Gaussian noise to the target voice frame for a preset number of times to obtain a sample voice stream.

based on the predicted speech frame and the target speech frame, parameters of the neural network in the diffusion model are updated.

In the process of training a diffusion model, taking data distribution in a target voice frame as original voice data, inputting the original voice data into the diffusion model, randomly adding Gaussian noise for preset times for the target voice frame through forward diffusion, wherein the data distribution of the obtained sample voice stream is noise data obtained through diffusion and is used for simulating voice stream before the target voice frame in time.

And then inputting the sample voice stream into a diffusion model for reverse generation, and predicting future voice frames from noise data of the sample voice stream, namely constructing predicted voice frames which are approximate to target voice frames, and taking the predicted voice frames as simulated original voice data obtained by recovering the noise data. Because the data distribution in a section of audio data is continuous, the original target voice frame cannot be directly obtained through reverse generation, and can only be simulated according to the sample voice stream.

Training a diffusion model is focused on a training reverse generation process, and updating parameters of a neural network in the diffusion model according to the difference between a predicted voice frame and a target voice frame; after the diffusion model is subjected to iterative training for a plurality of times, a trained reverse generating function can be obtained, and the trained diffusion model is further obtained.

In some more specific embodiments, updating parameters of the neural network in the diffusion model based on the predicted speech frame and the target speech frame specifically includes:

taking the difference between the data distribution of the target voice frame and the data distribution of the predicted voice frame as a predicted loss function of the diffusion model, minimizing the predicted loss function of the diffusion model as a target training diffusion model, and updating the parameters of the neural network in the diffusion model.

The predicted speech frame is the received speech data before the target speech frame is simulated through the forward diffusion process of the diffusion model, the sample speech stream is obtained, and the future speech data after the sample speech stream is obtained through the reverse generation process prediction is the result of approaching to the target speech frame. And carrying out iterative training on the diffusion model for a plurality of times with the aim of minimizing the data distribution difference between the predicted voice frame and the target voice frame, so as to update parameters in the diffusion model until the data distribution difference between the predicted voice frame and the target voice frame is smaller than a preset threshold value, and obtaining the trained diffusion model.

In some embodiments, before inputting the speech stream to be recognized into the pre-trained diffusion model, further comprising:

and inputting the segmented voice stream into a diffusion model to obtain a future voice frame corresponding to the segmented voice stream.

The duration of the voice stream to be recognized is different depending on the situation because the voice content representing the user's requirement is different, for example, when voice input is adopted, the user may input a longer section of voice, while the voice to be recognized when the intelligent household appliance is given a command is often shorter, and the voice dialogue messages of the user and the two parties in the intelligent customer service dialogue process are also longer or shorter. For the voice stream to be recognized with the duration shorter than the duration of the preset time window, the result of the block processing of the voice stream to be recognized is also the complete voice stream to be recognized; for the voice stream to be recognized with longer duration, the corresponding future voice frame can be generated for each block voice stream on the basis of a preset time window by adopting block processing, so that the delay of generating a voice recognition result is reduced to a greater extent, and the voice stream has stronger instantaneity compared with the voice frame which is generated once for the whole long voice stream.

104: and splicing the acquired at least one future voice frame with the voice stream to be recognized to obtain a spliced voice stream.

After the voice stream to be recognized is obtained, the voice stream is considered as a received voice stream which is successful to wait for further processing before being input into a diffusion model, and then a corresponding future voice frame is predicted by using a back diffusion process of the diffusion model, wherein the frame number of the future voice frame is required to be more than or equal to one frame so as to splice with the received voice stream, and then the spliced voice stream is recognized.

In some embodiments, the ratio of the future speech frames to the received speech stream in the spliced speech stream is greater than a preset ratio threshold.

In order to ensure that the speech recognition system can recognize the spliced speech stream effectively in real time, the future speech frames predicted by the diffusion model need to contain useful speech information and can play a role in increasing the semantic context of the speech information, so that the ratio of the future speech frames to the received speech stream needs to be controlled to ensure that the ratio of the future speech frames is not too small, thereby improving the accuracy of speech recognition.

106: and performing voice recognition on the spliced voice stream to obtain a voice recognition result.

In some embodiments, a speech recognition result may be obtained from the spliced speech stream using a speech-to-text algorithm in the prior art.

As shown in FIG. 2, in some embodiments, x is used during forward diffusion of the diffusion model ₀ Representing the input original speech data and conforming to the q distribution, i.e. gaussian noise can be added to the speech data by the q () function, e.g. original speech data x ₀ Using q (x) ₁ |x ₀ ) Adding Gaussian noise to obtain voice data x with one-time noise ₁ The method comprises the steps of carrying out a first treatment on the surface of the Setting the number of times of adding noise as N, then performing operation of q () function N times, and finally performing q (x _N |x _N-1 ) Obtaining noisy speech data x _N The noisy speech data may be provided as a sample speech stream to simulate a speech stream that temporally precedes the original speech data.

The q () function can be obtained by:

wherein x represents voice data; n represents the number of times noise is added to the voice data; beta _n Is gaussian noise conforming to normal distribution;is a noise function; i represents the range of the noise function.

In the reverse generation process of the diffusion model, the noise voice data x generated by forward diffusion is required _N As model input to gradually recover, or predict, original speech data x ₀ . Noise speech data x _N The conforming distribution is set to p, then the original speech data can be gradually restored using a p () function, e.g., using p (x) _N-1 |x _N ) Based on noise speech data x _N Obtaining intermediate data x _N-1 After N times of calculation according to this step, the mixture finally passes through p (x ₀ |x ₁ ) Obtaining recovered speech data x ₀ ′。

And the p distribution needs to be predicted by a deep learning model, which is set as:

wherein mu _θ (x _n N) and sigma _θ (x _n N) is the model parameter to be trained, Z can be used _θ To unify the representations; θ also represents model parameters.

During the reverse generation process of training the diffusion model, the original voice data x is used for ₀ And speech data x based on noise speech data recovery ₀ The distribution difference between' determines the predicted loss function of the diffusion model, and the diffusion model is trained for a plurality of iterations with the aim of minimizing the predicted loss function, and the parameter Z in the diffusion model is updated _θ Up to the original speech data x ₀ And recovered speech data x ₀ The distribution difference between' is smaller than a preset threshold value, and a trained diffusion model is obtained.

When the voice data added with random noise is needed to be predicted, the voice data is input into a trained diffusion model to be reversely generated, and then the corresponding future voice frame can be obtained through gradual calculation according to the p () function.

As shown in fig. 3, in a preferred embodiment, a voice stream of 640ms duration is first acquired as a voice stream to be recognized and received by the voice recognition system to await further processing; and inputting the voice stream to be recognized into a pre-trained diffusion model, predicting at least one frame of future voice frame corresponding to the voice stream to be recognized through a back diffusion process of the diffusion model, and setting the optimal ratio of the future voice frame to the received voice stream to be 1:4 in the embodiment in order to ensure the accuracy of voice recognition, so that the duration of the obtained future voice frame is 160ms. Preferably, the ratio of future speech frames to received speech stream may be set to be greater than or equal to 1:8. And splicing the future voice frame with the received voice stream to be recognized to obtain a spliced voice stream with the duration of 800ms, and further performing voice recognition on the spliced voice stream to obtain a recognition result.

According to the end-to-end stream voice recognition method provided by the embodiment of the specification, the future voice frames of the voice stream to be recognized are predicted by Gaussian noise in the pre-trained diffusion model, the spliced voice stream obtained by splicing the predicted future voice frames and the voice stream to be recognized is subjected to voice recognition, the predicted future voice frames are used for replacing the real future voice frames, the accuracy and the instantaneity of voice recognition are improved while the semantic context is increased, and the problem that the output recognition result of the voice recognition system is delayed is solved. In addition, the control of the ratio of the future speech frame to the received speech stream cannot be too small, so that the accuracy of speech recognition is ensured.

In another embodiment of the present specification, an end-to-end streaming voice recognition apparatus is presented.

As shown in fig. 4, includes:

a receiving module 20 configured to receive a voice stream to be recognized;

a future speech prediction module 22 configured to input the speech stream to be recognized into a pre-trained diffusion model, and predict at least one frame of future speech frame of the speech stream to be recognized through a back diffusion process of the diffusion model;

a splicing module 24, configured to splice the acquired at least one future speech frame with the speech stream to be identified, so as to obtain a spliced speech stream;

the voice recognition module 26 is configured to perform voice recognition on the spliced voice stream to obtain a voice recognition result.

The receiving module can receive a section of voice stream to be recognized, which is collected by the microphone, the voice stream to be recognized needs to contain recognizable voice content, and the duration of the voice stream is not too short, so that good recognition effect is ensured.

The diffusion model in the future voice prediction module gradually adds random noise with preset times to input original data in the forward diffusion process by defining a Markov diffusion step chain, and outputs noise data; the back-diffusion process, which may also be referred to as a back-generation process, is then learned and trained to construct data samples from the noise data that approximate the actual raw data.

In some embodiments, the future speech prediction module further includes a training module, where the training module is configured to obtain a target speech frame; determining data distribution in the target voice frame, and taking the data distribution in the target voice frame as a training label; successively and randomly adding Gaussian noise to a target voice frame for a preset number of times to obtain a sample voice stream; inputting the sample voice stream into a preset diffusion model to obtain a predicted voice frame; based on the predicted speech frame and the target speech frame, parameters of the neural network in the diffusion model are updated.

The training module is used for training a diffusion model, taking data distribution in a target voice frame as original voice data, inputting the original voice data into the diffusion model, randomly adding Gaussian noise for preset times for the target voice frame through forward diffusion, and obtaining data distribution of a sample voice stream, namely noise data obtained through diffusion, for simulating the voice stream before the target voice frame in time.

The key point of the training module is to train the reverse generation process in the diffusion model, and update the parameters of the neural network in the diffusion model according to the difference between the predicted voice frame and the target voice frame, wherein the parameters required to be updated comprise parameters in the reverse generation function for predicting the future voice frame from the noise data of the sample voice stream; after the diffusion model is subjected to iterative training for a plurality of times, a trained reverse generating function can be obtained, and the trained diffusion model is further obtained.

In some more specific embodiments, the training module is configured to train the diffusion model with a predicted loss function of the diffusion model as a predicted loss function of the diffusion model by taking a difference between a data distribution of the target speech frame and a data distribution of the predicted speech frame as the target, and update parameters of the neural network in the diffusion model.

The predicted speech frame is the received speech data before the target speech frame is simulated through the forward diffusion process of the diffusion model, the sample speech stream is obtained, and the future speech data after the sample speech stream is obtained through the reverse generation process prediction is the result of approaching to the target speech frame. The training module carries out iterative training on the diffusion model for a plurality of times with the aim of minimizing the data distribution difference between the predicted voice frame and the target voice frame so as to update parameters in the diffusion model until the data distribution difference between the predicted voice frame and the target voice frame is smaller than a preset threshold value, and a trained diffusion model is obtained.

In some embodiments, the future speech prediction module is specifically further configured to perform a blocking process on a speech stream to be recognized through a preset time window, so as to obtain at least one blocking speech stream; and inputting the segmented voice stream into a diffusion model to obtain a future voice frame corresponding to the segmented voice stream.

The duration of the voice stream to be recognized is different depending on the situation because the voice content representing the user's requirement is different, for example, when voice input is adopted, the user may input a longer section of voice, while the voice to be recognized when the intelligent household appliance is given a command is often shorter, and the voice dialogue messages of the user and the two parties in the intelligent customer service dialogue process are also longer or shorter. For the voice stream to be recognized with the duration shorter than the duration of the preset time window, the result of the blocking processing of the voice stream by the voice prediction module in the future is also the complete voice stream to be recognized; for the voice stream to be recognized with longer duration, the future voice prediction module can generate a corresponding future voice frame for each segmented voice stream on the basis of a preset time window by adopting the segmentation processing, so that the delay of generating a voice recognition result is reduced to a greater extent, and the method has stronger instantaneity compared with the method that the whole long voice stream only generates one future voice frame.

After the voice stream to be recognized is obtained, the voice stream is considered as a received voice stream which is successful to wait for further processing before being input into a diffusion model, and then a corresponding future voice frame is predicted by utilizing the back diffusion process of the diffusion model, wherein the frame number of the future voice frame is required to be more than or equal to one frame so as to splice the received voice stream through a splicing module, and then the spliced voice stream is recognized.

In some embodiments, the splicing module is specifically configured to splice the future speech frame and the speech stream to be identified according to a preset ratio to obtain a spliced speech stream; in the spliced voice stream, the ratio of the future voice frame to the received voice stream is larger than a preset ratio threshold.

In order to ensure that the speech recognition system can recognize the spliced speech stream effectively in real time, the future speech frames predicted by the diffusion model need to contain useful speech information and can play a role in increasing the semantic context of the speech information, so that the proportion between the future speech frames and the received speech stream needs to be controlled by the splicing module to ensure that the ratio of the future speech frames is not too small, thereby improving the accuracy of speech recognition.

In some embodiments, the speech recognition module may obtain the speech recognition result from the spliced speech stream using a speech-to-text algorithm in the prior art.

One embodiment of the present specification also provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor implements the end-to-end streaming voice recognition method described above.

One embodiment in the present specification also provides an electronic device, including:

one or more processors; and

a memory associated with the one or more processors, the memory for storing program instructions that, when read and executed by the one or more processors, perform the end-to-end streaming voice recognition method described above.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

It should be noted that the above-mentioned embodiments are merely examples of the present invention, and it is obvious that the present invention is not limited to the above-mentioned embodiments, and many similar variations are possible. All modifications attainable or obvious from the present disclosure set forth herein should be deemed to be within the scope of the present disclosure.

Claims

1. An end-to-end streaming voice recognition method comprising:

acquiring a voice stream to be recognized;

2. The method of claim 1, wherein the diffusion model is pre-trained by:

acquiring a target voice frame;

3. The method according to claim 2, wherein the updating parameters of the neural network in the diffusion model based on the predicted speech frame and the target speech frame specifically comprises:

4. The method of claim 1, further comprising, prior to inputting the stream of speech to be recognized into a pre-trained diffusion model:

5. The method of claim 1, wherein a ratio of the future speech frames to the received speech stream in the spliced speech stream is greater than a preset ratio threshold.

6. An end-to-end streaming voice recognition device comprising:

a receiving module configured to receive a voice stream to be recognized;

7. The apparatus of claim 6, wherein the future speech prediction module is further specifically configured to perform a blocking process on the speech stream to be recognized through a preset time window to obtain at least one blocked speech stream; and inputting the segmented voice stream into the diffusion model to obtain a future voice frame corresponding to the segmented voice stream.

8. The device of claim 6, wherein the splicing module is specifically configured to splice the future speech frame and the speech stream to be recognized according to a preset ratio to obtain the spliced speech stream; in the spliced voice stream, the ratio of the future voice frame to the received voice stream is larger than a preset ratio threshold.

9. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1 to 5.

10. An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of claims 1 to 5.