CN116741160A - Streaming voice recognition method and device and electronic equipment - Google Patents

Streaming voice recognition method and device and electronic equipment Download PDF

Info

Publication number
CN116741160A
CN116741160A CN202310627778.4A CN202310627778A CN116741160A CN 116741160 A CN116741160 A CN 116741160A CN 202310627778 A CN202310627778 A CN 202310627778A CN 116741160 A CN116741160 A CN 116741160A
Authority
CN
China
Prior art keywords
voice
stream
speech
frame
diffusion model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310627778.4A
Other languages
Chinese (zh)
Inventor
李强
王志铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202310627778.4A priority Critical patent/CN116741160A/en
Publication of CN116741160A publication Critical patent/CN116741160A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The embodiment of the specification discloses an end-to-end flow type voice recognition method, which comprises the following steps: acquiring a voice stream to be recognized; inputting the voice stream to be recognized into a pre-trained diffusion model, and predicting at least one future voice frame of the voice stream to be recognized through a back diffusion process of the diffusion model; splicing the acquired at least one frame of the future voice frame with the voice stream to be recognized to obtain a spliced voice stream; and performing voice recognition on the spliced voice stream to obtain a voice recognition result. Accordingly, the invention discloses an end-to-end streaming voice recognition device.

Description

Streaming voice recognition method and device and electronic equipment
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for stream type speech recognition, and an electronic device.
Background
In streaming speech recognition, small amounts of future speech information are typically added to increase the contextual speech information in the recognition process. These future speech information typically employ real underlying audio information, resulting in the recognition system speaking words later than the actual user.
Disclosure of Invention
One of the purposes of the invention is to provide an end-to-end stream voice recognition method, which can increase semantic context information by predicting future voice frames to replace real context audio information, reduce delay rate in the voice recognition process and improve accuracy of real-time voice recognition.
According to the above object, an embodiment of the present disclosure provides an end-to-end streaming voice recognition method, including:
acquiring a voice stream to be recognized;
inputting the voice stream to be recognized into a pre-trained diffusion model, and predicting at least one future voice frame of the voice stream to be recognized through a back diffusion process of the diffusion model;
splicing the acquired at least one frame of the future voice frame with the voice stream to be recognized to obtain a spliced voice stream;
and performing voice recognition on the spliced voice stream to obtain a voice recognition result.
In the embodiment of the specification, a new end-to-end stream voice recognition method is provided, future voice frames of a voice stream to be recognized are predicted through a pre-trained diffusion model, the predicted future voice frames are spliced with the spliced voice stream to be recognized, voice recognition is performed on the spliced voice stream obtained by splicing the predicted future voice frames with the voice stream to be recognized, the predicted future voice frames are used for replacing the real future voice frames, the accuracy and the instantaneity of voice recognition are improved while the semantic context is increased, and the problem that the output recognition result of a voice recognition system is delayed is solved.
Further, in some embodiments, the diffusion model is pre-trained as follows:
acquiring a target voice frame;
determining data distribution in the target voice frame, and taking the data distribution in the target voice frame as a training label;
successively and randomly adding Gaussian noise to the target voice frame for a preset number of times to obtain a sample voice stream;
inputting the sample voice stream into a preset diffusion model to obtain a predicted voice frame;
and updating parameters of the neural network in the diffusion model based on the predicted voice frame and the target voice frame.
Still further, in some embodiments, the updating parameters of the neural network in the diffusion model based on the predicted speech frame and the target speech frame specifically includes:
and taking the difference between the data distribution of the target voice frame and the data distribution of the predicted voice frame as a predicted loss function of the diffusion model, training the diffusion model with the predicted loss function of the diffusion model as a target, and updating the parameters of the neural network in the diffusion model.
Further, in some embodiments, before inputting the to-be-recognized voice stream into the pre-trained diffusion model, the method further includes:
performing block processing on the voice stream to be recognized according to a preset time window to obtain at least one block voice stream;
and inputting the segmented voice stream into the diffusion model to obtain a future voice frame corresponding to the segmented voice stream.
Further, in some embodiments, a ratio of the future speech frames to the received speech stream in the spliced speech stream is greater than a preset ratio threshold.
Another object of the present invention is to provide an end-to-end streaming voice recognition apparatus, which can improve the accuracy of voice recognition by predicting the information of future voice frames, and solve the problem of delay of recognition results.
According to the above object, an embodiment of the present disclosure provides an end-to-end streaming voice recognition apparatus, including:
a receiving module configured to receive a voice stream to be recognized;
a future speech prediction module configured to input the speech stream to be recognized into a pre-trained diffusion model, and predict at least one frame of future speech frame of the speech stream to be recognized through a back diffusion process of the diffusion model;
the splicing module is configured to splice the acquired at least one frame of the future voice frame with the voice stream to be recognized to obtain a spliced voice stream;
and the voice recognition module is configured to perform voice recognition on the spliced voice stream to obtain a voice recognition result.
Further, in some embodiments, the future speech prediction module is specifically further configured to perform a blocking process on the speech stream to be identified through a preset time window, so as to obtain at least one blocking speech stream; and inputting the segmented voice stream into the diffusion model to obtain a future voice frame corresponding to the segmented voice stream.
Further, in some embodiments, the splicing module is specifically configured to splice the future speech frame and the speech stream to be identified according to a preset ratio, so as to obtain the spliced speech stream; in the spliced voice stream, the ratio of the future voice frame to the received voice stream is larger than a preset ratio threshold.
The embodiments of the present specification also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described end-to-end streaming voice recognition method.
The embodiment of the specification also provides an electronic device, including:
one or more processors; and
a memory associated with the one or more processors, the memory for storing program instructions that, when read for execution by the one or more processors, perform the end-to-end streaming voice recognition method described above.
The end-to-end stream voice recognition method disclosed by the embodiment of the specification has the beneficial effects that the future voice frames of the voice stream to be recognized are predicted by Gaussian noise in the pre-trained diffusion model, the spliced voice stream obtained by splicing the predicted future voice frames and the voice stream to be recognized is subjected to voice recognition, the predicted future voice frames are used for replacing the real future voice frames, the accuracy and the instantaneity of voice recognition are improved while the semantic context is increased, and the problem that the output recognition result of the voice recognition system is delayed is solved. In addition, the control of the ratio of the future speech frame to the received speech stream cannot be too small, so that the accuracy of speech recognition is ensured.
The end-to-end flow type voice recognition device described in the embodiments of the present disclosure also has the above-mentioned beneficial effects.
Drawings
Fig. 1 schematically illustrates a flow diagram of an end-to-end flow speech recognition method according to an embodiment of the present disclosure in one implementation.
Fig. 2 schematically shows a flowchart of a diffusion model in an end-to-end flow speech recognition method according to an embodiment of the present disclosure, performed in one implementation.
Fig. 3 schematically illustrates an application of the end-to-end flow speech recognition method according to the embodiment of the present specification in a specific implementation manner.
Fig. 4 schematically illustrates a schematic structure of an end-to-end flow speech recognition device according to an embodiment of the present disclosure in one implementation.
Detailed Description
It is first noted that the terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. In addition, the embodiments and features in the embodiments in the present specification may be combined with each other without conflict.
The end-to-end flow speech recognition method and apparatus described in the embodiments of the present specification will be further described in detail below with reference to the accompanying drawings and specific embodiments, but the detailed description is not to be construed as limiting the embodiments of the present specification.
The voice recognition is widely applied in life, common voice messages are converted into characters, voice input which can replace keyboard input, voice-controlled intelligent household appliances and the like are all convenient for life, and the voice recognition system is more in line with daily habits of people through voice transmission information, and is more natural and efficient. Compared with the common voice recognition, the streaming voice recognition requires stronger real-time performance and is generally used in the fields of intelligent customer service, interactive wind control and the like which need dialogue. Because the voice is required to be converted into a text result in real time, the context voice information in the recognition process can be supplemented by acquiring future voice frames, so that the accuracy of voice recognition is ensured, but the adoption of the real future voice frames inevitably leads to the fact that the time for outputting the recognition result is later than the time for actually inputting the voice by a user, and the requirement of high real-time performance cannot be met due to higher delay; if the streaming voice recognition is performed in a low-delay environment, the accuracy of the recognition result is not satisfactory.
The end-to-end voice recognition system represents a voice recognition system constructed by utilizing a neural network model, and the system has strong portability and reusability, and can realize ideal streaming voice recognition with lower cost. In order to achieve both low latency and high accuracy of the streaming speech recognition, the embodiments of the present disclosure implement end-to-end streaming speech recognition based on a diffusion model that is used to predict future speech frames instead of real future speech frames for supplementing contextual speech information.
In one embodiment of the present specification, an end-to-end streaming voice recognition method is presented. Fig. 1 schematically illustrates a flow diagram of an end-to-end flow speech recognition method according to an embodiment of the present disclosure in one implementation.
As shown in fig. 1, includes:
100: and acquiring the voice stream to be recognized.
A section of voice stream to be recognized, which needs to contain recognizable voice content, can be collected through a microphone, and the duration of the voice stream is not too short so as to ensure good recognition effect.
102: inputting the voice stream to be recognized into a pre-trained diffusion model, and predicting at least one future voice frame of the voice stream to be recognized through a back diffusion process of the diffusion model.
The inspiration of the diffusion model is derived from non-equilibrium state thermodynamics, and random noise with preset times is gradually added to input original data in the forward diffusion process by defining a Markov diffusion step chain, and noise data is output; the back-diffusion process, which may also be referred to as a back-generation process, is then learned and trained to construct data samples from the noise data that approximate the actual raw data.
In some embodiments, the diffusion model is pre-trained as follows:
acquiring a target voice frame;
determining data distribution in the target voice frame, and taking the data distribution in the target voice frame as a training label;
and successively and randomly adding Gaussian noise to the target voice frame for a preset number of times to obtain a sample voice stream.
Inputting the sample voice stream into a preset diffusion model to obtain a predicted voice frame;
based on the predicted speech frame and the target speech frame, parameters of the neural network in the diffusion model are updated.
In the process of training a diffusion model, taking data distribution in a target voice frame as original voice data, inputting the original voice data into the diffusion model, randomly adding Gaussian noise for preset times for the target voice frame through forward diffusion, wherein the data distribution of the obtained sample voice stream is noise data obtained through diffusion and is used for simulating voice stream before the target voice frame in time.
And then inputting the sample voice stream into a diffusion model for reverse generation, and predicting future voice frames from noise data of the sample voice stream, namely constructing predicted voice frames which are approximate to target voice frames, and taking the predicted voice frames as simulated original voice data obtained by recovering the noise data. Because the data distribution in a section of audio data is continuous, the original target voice frame cannot be directly obtained through reverse generation, and can only be simulated according to the sample voice stream.
Training a diffusion model is focused on a training reverse generation process, and updating parameters of a neural network in the diffusion model according to the difference between a predicted voice frame and a target voice frame; after the diffusion model is subjected to iterative training for a plurality of times, a trained reverse generating function can be obtained, and the trained diffusion model is further obtained.
In some more specific embodiments, updating parameters of the neural network in the diffusion model based on the predicted speech frame and the target speech frame specifically includes:
taking the difference between the data distribution of the target voice frame and the data distribution of the predicted voice frame as a predicted loss function of the diffusion model, minimizing the predicted loss function of the diffusion model as a target training diffusion model, and updating the parameters of the neural network in the diffusion model.
The predicted speech frame is the received speech data before the target speech frame is simulated through the forward diffusion process of the diffusion model, the sample speech stream is obtained, and the future speech data after the sample speech stream is obtained through the reverse generation process prediction is the result of approaching to the target speech frame. And carrying out iterative training on the diffusion model for a plurality of times with the aim of minimizing the data distribution difference between the predicted voice frame and the target voice frame, so as to update parameters in the diffusion model until the data distribution difference between the predicted voice frame and the target voice frame is smaller than a preset threshold value, and obtaining the trained diffusion model.
In some embodiments, before inputting the speech stream to be recognized into the pre-trained diffusion model, further comprising:
performing block processing on the voice stream to be recognized according to a preset time window to obtain at least one block voice stream;
and inputting the segmented voice stream into a diffusion model to obtain a future voice frame corresponding to the segmented voice stream.
The duration of the voice stream to be recognized is different depending on the situation because the voice content representing the user's requirement is different, for example, when voice input is adopted, the user may input a longer section of voice, while the voice to be recognized when the intelligent household appliance is given a command is often shorter, and the voice dialogue messages of the user and the two parties in the intelligent customer service dialogue process are also longer or shorter. For the voice stream to be recognized with the duration shorter than the duration of the preset time window, the result of the block processing of the voice stream to be recognized is also the complete voice stream to be recognized; for the voice stream to be recognized with longer duration, the corresponding future voice frame can be generated for each block voice stream on the basis of a preset time window by adopting block processing, so that the delay of generating a voice recognition result is reduced to a greater extent, and the voice stream has stronger instantaneity compared with the voice frame which is generated once for the whole long voice stream.
104: and splicing the acquired at least one future voice frame with the voice stream to be recognized to obtain a spliced voice stream.
After the voice stream to be recognized is obtained, the voice stream is considered as a received voice stream which is successful to wait for further processing before being input into a diffusion model, and then a corresponding future voice frame is predicted by using a back diffusion process of the diffusion model, wherein the frame number of the future voice frame is required to be more than or equal to one frame so as to splice with the received voice stream, and then the spliced voice stream is recognized.
In some embodiments, the ratio of the future speech frames to the received speech stream in the spliced speech stream is greater than a preset ratio threshold.
In order to ensure that the speech recognition system can recognize the spliced speech stream effectively in real time, the future speech frames predicted by the diffusion model need to contain useful speech information and can play a role in increasing the semantic context of the speech information, so that the ratio of the future speech frames to the received speech stream needs to be controlled to ensure that the ratio of the future speech frames is not too small, thereby improving the accuracy of speech recognition.
106: and performing voice recognition on the spliced voice stream to obtain a voice recognition result.
In some embodiments, a speech recognition result may be obtained from the spliced speech stream using a speech-to-text algorithm in the prior art.
Fig. 2 schematically shows a flowchart of a diffusion model in an end-to-end flow speech recognition method according to an embodiment of the present disclosure, performed in one implementation.
As shown in FIG. 2, in some embodiments, x is used during forward diffusion of the diffusion model 0 Representing the input original speech data and conforming to the q distribution, i.e. gaussian noise can be added to the speech data by the q () function, e.g. original speech data x 0 Using q (x) 1 |x 0 ) Adding Gaussian noise to obtain voice data x with one-time noise 1 The method comprises the steps of carrying out a first treatment on the surface of the Setting the number of times of adding noise as N, then performing operation of q () function N times, and finally performing q (x N |x N-1 ) Obtaining noisy speech data x N The noisy speech data may be provided as a sample speech stream to simulate a speech stream that temporally precedes the original speech data.
The q () function can be obtained by:
wherein x represents voice data; n represents the number of times noise is added to the voice data; beta n Is gaussian noise conforming to normal distribution;is a noise function; i represents the range of the noise function.
In the reverse generation process of the diffusion model, the noise voice data x generated by forward diffusion is required N As model input to gradually recover, or predict, original speech data x 0 . Noise speech data x N The conforming distribution is set to p, then the original speech data can be gradually restored using a p () function, e.g., using p (x) N-1 |x N ) Based on noise speech data x N Obtaining intermediate data x N-1 After N times of calculation according to this step, the mixture finally passes through p (x 0 |x 1 ) Obtaining recovered speech data x 0 ′。
And the p distribution needs to be predicted by a deep learning model, which is set as:
wherein mu θ (x n N) and sigma θ (x n N) is the model parameter to be trained, Z can be used θ To unify the representations; θ also represents model parameters.
During the reverse generation process of training the diffusion model, the original voice data x is used for 0 And speech data x based on noise speech data recovery 0 The distribution difference between' determines the predicted loss function of the diffusion model, and the diffusion model is trained for a plurality of iterations with the aim of minimizing the predicted loss function, and the parameter Z in the diffusion model is updated θ Up to the original speech data x 0 And recovered speech data x 0 The distribution difference between' is smaller than a preset threshold value, and a trained diffusion model is obtained.
When the voice data added with random noise is needed to be predicted, the voice data is input into a trained diffusion model to be reversely generated, and then the corresponding future voice frame can be obtained through gradual calculation according to the p () function.
Fig. 3 schematically illustrates an application of the end-to-end flow speech recognition method according to the embodiment of the present specification in a specific implementation manner.
As shown in fig. 3, in a preferred embodiment, a voice stream of 640ms duration is first acquired as a voice stream to be recognized and received by the voice recognition system to await further processing; and inputting the voice stream to be recognized into a pre-trained diffusion model, predicting at least one frame of future voice frame corresponding to the voice stream to be recognized through a back diffusion process of the diffusion model, and setting the optimal ratio of the future voice frame to the received voice stream to be 1:4 in the embodiment in order to ensure the accuracy of voice recognition, so that the duration of the obtained future voice frame is 160ms. Preferably, the ratio of future speech frames to received speech stream may be set to be greater than or equal to 1:8. And splicing the future voice frame with the received voice stream to be recognized to obtain a spliced voice stream with the duration of 800ms, and further performing voice recognition on the spliced voice stream to obtain a recognition result.
According to the end-to-end stream voice recognition method provided by the embodiment of the specification, the future voice frames of the voice stream to be recognized are predicted by Gaussian noise in the pre-trained diffusion model, the spliced voice stream obtained by splicing the predicted future voice frames and the voice stream to be recognized is subjected to voice recognition, the predicted future voice frames are used for replacing the real future voice frames, the accuracy and the instantaneity of voice recognition are improved while the semantic context is increased, and the problem that the output recognition result of the voice recognition system is delayed is solved. In addition, the control of the ratio of the future speech frame to the received speech stream cannot be too small, so that the accuracy of speech recognition is ensured.
In another embodiment of the present specification, an end-to-end streaming voice recognition apparatus is presented.
Fig. 4 schematically illustrates a schematic structure of an end-to-end flow speech recognition device according to an embodiment of the present disclosure in one implementation.
As shown in fig. 4, includes:
a receiving module 20 configured to receive a voice stream to be recognized;
a future speech prediction module 22 configured to input the speech stream to be recognized into a pre-trained diffusion model, and predict at least one frame of future speech frame of the speech stream to be recognized through a back diffusion process of the diffusion model;
a splicing module 24, configured to splice the acquired at least one future speech frame with the speech stream to be identified, so as to obtain a spliced speech stream;
the voice recognition module 26 is configured to perform voice recognition on the spliced voice stream to obtain a voice recognition result.
The receiving module can receive a section of voice stream to be recognized, which is collected by the microphone, the voice stream to be recognized needs to contain recognizable voice content, and the duration of the voice stream is not too short, so that good recognition effect is ensured.
The diffusion model in the future voice prediction module gradually adds random noise with preset times to input original data in the forward diffusion process by defining a Markov diffusion step chain, and outputs noise data; the back-diffusion process, which may also be referred to as a back-generation process, is then learned and trained to construct data samples from the noise data that approximate the actual raw data.
In some embodiments, the future speech prediction module further includes a training module, where the training module is configured to obtain a target speech frame; determining data distribution in the target voice frame, and taking the data distribution in the target voice frame as a training label; successively and randomly adding Gaussian noise to a target voice frame for a preset number of times to obtain a sample voice stream; inputting the sample voice stream into a preset diffusion model to obtain a predicted voice frame; based on the predicted speech frame and the target speech frame, parameters of the neural network in the diffusion model are updated.
The training module is used for training a diffusion model, taking data distribution in a target voice frame as original voice data, inputting the original voice data into the diffusion model, randomly adding Gaussian noise for preset times for the target voice frame through forward diffusion, and obtaining data distribution of a sample voice stream, namely noise data obtained through diffusion, for simulating the voice stream before the target voice frame in time.
And then inputting the sample voice stream into a diffusion model for reverse generation, and predicting future voice frames from noise data of the sample voice stream, namely constructing predicted voice frames which are approximate to target voice frames, and taking the predicted voice frames as simulated original voice data obtained by recovering the noise data. Because the data distribution in a section of audio data is continuous, the original target voice frame cannot be directly obtained through reverse generation, and can only be simulated according to the sample voice stream.
The key point of the training module is to train the reverse generation process in the diffusion model, and update the parameters of the neural network in the diffusion model according to the difference between the predicted voice frame and the target voice frame, wherein the parameters required to be updated comprise parameters in the reverse generation function for predicting the future voice frame from the noise data of the sample voice stream; after the diffusion model is subjected to iterative training for a plurality of times, a trained reverse generating function can be obtained, and the trained diffusion model is further obtained.
In some more specific embodiments, the training module is configured to train the diffusion model with a predicted loss function of the diffusion model as a predicted loss function of the diffusion model by taking a difference between a data distribution of the target speech frame and a data distribution of the predicted speech frame as the target, and update parameters of the neural network in the diffusion model.
The predicted speech frame is the received speech data before the target speech frame is simulated through the forward diffusion process of the diffusion model, the sample speech stream is obtained, and the future speech data after the sample speech stream is obtained through the reverse generation process prediction is the result of approaching to the target speech frame. The training module carries out iterative training on the diffusion model for a plurality of times with the aim of minimizing the data distribution difference between the predicted voice frame and the target voice frame so as to update parameters in the diffusion model until the data distribution difference between the predicted voice frame and the target voice frame is smaller than a preset threshold value, and a trained diffusion model is obtained.
In some embodiments, the future speech prediction module is specifically further configured to perform a blocking process on a speech stream to be recognized through a preset time window, so as to obtain at least one blocking speech stream; and inputting the segmented voice stream into a diffusion model to obtain a future voice frame corresponding to the segmented voice stream.
The duration of the voice stream to be recognized is different depending on the situation because the voice content representing the user's requirement is different, for example, when voice input is adopted, the user may input a longer section of voice, while the voice to be recognized when the intelligent household appliance is given a command is often shorter, and the voice dialogue messages of the user and the two parties in the intelligent customer service dialogue process are also longer or shorter. For the voice stream to be recognized with the duration shorter than the duration of the preset time window, the result of the blocking processing of the voice stream by the voice prediction module in the future is also the complete voice stream to be recognized; for the voice stream to be recognized with longer duration, the future voice prediction module can generate a corresponding future voice frame for each segmented voice stream on the basis of a preset time window by adopting the segmentation processing, so that the delay of generating a voice recognition result is reduced to a greater extent, and the method has stronger instantaneity compared with the method that the whole long voice stream only generates one future voice frame.
After the voice stream to be recognized is obtained, the voice stream is considered as a received voice stream which is successful to wait for further processing before being input into a diffusion model, and then a corresponding future voice frame is predicted by utilizing the back diffusion process of the diffusion model, wherein the frame number of the future voice frame is required to be more than or equal to one frame so as to splice the received voice stream through a splicing module, and then the spliced voice stream is recognized.
In some embodiments, the splicing module is specifically configured to splice the future speech frame and the speech stream to be identified according to a preset ratio to obtain a spliced speech stream; in the spliced voice stream, the ratio of the future voice frame to the received voice stream is larger than a preset ratio threshold.
In order to ensure that the speech recognition system can recognize the spliced speech stream effectively in real time, the future speech frames predicted by the diffusion model need to contain useful speech information and can play a role in increasing the semantic context of the speech information, so that the proportion between the future speech frames and the received speech stream needs to be controlled by the splicing module to ensure that the ratio of the future speech frames is not too small, thereby improving the accuracy of speech recognition.
In some embodiments, the speech recognition module may obtain the speech recognition result from the spliced speech stream using a speech-to-text algorithm in the prior art.
One embodiment of the present specification also provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor implements the end-to-end streaming voice recognition method described above.
One embodiment in the present specification also provides an electronic device, including:
one or more processors; and
a memory associated with the one or more processors, the memory for storing program instructions that, when read and executed by the one or more processors, perform the end-to-end streaming voice recognition method described above.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
It should be noted that the above-mentioned embodiments are merely examples of the present invention, and it is obvious that the present invention is not limited to the above-mentioned embodiments, and many similar variations are possible. All modifications attainable or obvious from the present disclosure set forth herein should be deemed to be within the scope of the present disclosure.

Claims (10)

1. An end-to-end streaming voice recognition method comprising:
acquiring a voice stream to be recognized;
inputting the voice stream to be recognized into a pre-trained diffusion model, and predicting at least one future voice frame of the voice stream to be recognized through a back diffusion process of the diffusion model;
splicing the acquired at least one frame of the future voice frame with the voice stream to be recognized to obtain a spliced voice stream;
and performing voice recognition on the spliced voice stream to obtain a voice recognition result.
2. The method of claim 1, wherein the diffusion model is pre-trained by:
acquiring a target voice frame;
determining data distribution in the target voice frame, and taking the data distribution in the target voice frame as a training label;
successively and randomly adding Gaussian noise to the target voice frame for a preset number of times to obtain a sample voice stream;
inputting the sample voice stream into a preset diffusion model to obtain a predicted voice frame;
and updating parameters of the neural network in the diffusion model based on the predicted voice frame and the target voice frame.
3. The method according to claim 2, wherein the updating parameters of the neural network in the diffusion model based on the predicted speech frame and the target speech frame specifically comprises:
and taking the difference between the data distribution of the target voice frame and the data distribution of the predicted voice frame as a predicted loss function of the diffusion model, training the diffusion model with the predicted loss function of the diffusion model as a target, and updating the parameters of the neural network in the diffusion model.
4. The method of claim 1, further comprising, prior to inputting the stream of speech to be recognized into a pre-trained diffusion model:
performing block processing on the voice stream to be recognized according to a preset time window to obtain at least one block voice stream;
and inputting the segmented voice stream into the diffusion model to obtain a future voice frame corresponding to the segmented voice stream.
5. The method of claim 1, wherein a ratio of the future speech frames to the received speech stream in the spliced speech stream is greater than a preset ratio threshold.
6. An end-to-end streaming voice recognition device comprising:
a receiving module configured to receive a voice stream to be recognized;
a future speech prediction module configured to input the speech stream to be recognized into a pre-trained diffusion model, and predict at least one frame of future speech frame of the speech stream to be recognized through a back diffusion process of the diffusion model;
the splicing module is configured to splice the acquired at least one frame of the future voice frame with the voice stream to be recognized to obtain a spliced voice stream;
and the voice recognition module is configured to perform voice recognition on the spliced voice stream to obtain a voice recognition result.
7. The apparatus of claim 6, wherein the future speech prediction module is further specifically configured to perform a blocking process on the speech stream to be recognized through a preset time window to obtain at least one blocked speech stream; and inputting the segmented voice stream into the diffusion model to obtain a future voice frame corresponding to the segmented voice stream.
8. The device of claim 6, wherein the splicing module is specifically configured to splice the future speech frame and the speech stream to be recognized according to a preset ratio to obtain the spliced speech stream; in the spliced voice stream, the ratio of the future voice frame to the received voice stream is larger than a preset ratio threshold.
9. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1 to 5.
10. An electronic device, comprising:
one or more processors; and
a memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of claims 1 to 5.
CN202310627778.4A 2023-05-30 2023-05-30 Streaming voice recognition method and device and electronic equipment Pending CN116741160A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310627778.4A CN116741160A (en) 2023-05-30 2023-05-30 Streaming voice recognition method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310627778.4A CN116741160A (en) 2023-05-30 2023-05-30 Streaming voice recognition method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN116741160A true CN116741160A (en) 2023-09-12

Family

ID=87914385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310627778.4A Pending CN116741160A (en) 2023-05-30 2023-05-30 Streaming voice recognition method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN116741160A (en)

Similar Documents

Publication Publication Date Title
Ravanelli et al. Multi-task self-supervised learning for robust speech recognition
EP3926623B1 (en) Speech recognition method and apparatus, and neural network training method and apparatus
CN109891434B (en) Generating audio using neural networks
CN109785824B (en) Training method and device of voice translation model
CN110546656B (en) Feedforward generation type neural network
Ravanelli et al. Improving speech recognition by revising gated recurrent units
US9595257B2 (en) Downsampling schemes in a hierarchical neural network structure for phoneme recognition
US7418383B2 (en) Noise robust speech recognition with a switching linear dynamic model
US8386254B2 (en) Multi-class constrained maximum likelihood linear regression
US11355097B2 (en) Sample-efficient adaptive text-to-speech
CN111754992B (en) Noise robust audio/video bimodal speech recognition method and system
CN109697978B (en) Method and apparatus for generating a model
WO2019138897A1 (en) Learning device and method, and program
JP2010078650A (en) Speech recognizer and method thereof
JP4069715B2 (en) Acoustic model creation method and speech recognition apparatus
CN114882861A (en) Voice generation method, device, equipment, medium and product
CN113851113A (en) Model training method and device and voice awakening method and device
WO2021229643A1 (en) Sound signal conversion model learning device, sound signal conversion device, sound signal conversion model learning method, and program
CN113178200A (en) Voice conversion method, device, server and storage medium
Xiang et al. A bayesian permutation training deep representation learning method for speech enhancement with variational autoencoder
KR102613030B1 (en) Speech synthesis method and apparatus using adversarial learning technique
CN116741160A (en) Streaming voice recognition method and device and electronic equipment
CN112150103B (en) Schedule setting method, schedule setting device and storage medium
CN116074574A (en) Video processing method, device, equipment and storage medium
Pascual De La Puente Efficient, end-to-end and self-supervised methods for speech processing and generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination