CN117079653A

CN117079653A - Speech recognition method, training method, device and medium for speech recognition model

Info

Publication number: CN117079653A
Application number: CN202311312710.3A
Authority: CN
Inventors: 谢继亮
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-10-11
Filing date: 2023-10-11
Publication date: 2023-11-17

Abstract

The application discloses a voice recognition method, a training method of a voice recognition model, equipment and a medium, relates to the technical field of voice processing, and is used for providing a voice recognition model with a streaming encoder and a non-streaming encoder, and the voice recognition model can be suitable for electronic equipment with different computing power deployment capacities. The method is applied to electronic equipment, and the electronic equipment is associated with a voice recognition model which comprises a streaming encoder and a non-streaming encoder. The electronic device is of a first device type, the method comprising: receiving an audio stream input by a user; performing voice recognition on the audio stream based on the stream encoder to obtain a voice recognition result of the audio stream; the electronic device is of a second device type, the method comprising: receiving an audio stream input by a user; performing voice recognition on the audio stream based on the stream encoder and the non-stream encoder to obtain a voice recognition result of the audio stream; wherein the computing power deployment capability of the first device type is lower than the computing power deployment capability of the second device type.

Description

Speech recognition method, training method, device and medium for speech recognition model

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech recognition method, a training method for a speech recognition model, a device, and a medium.

Background

Speech recognition (automatic speech recognition, ASR) is a technical means of converting speech into corresponding text. In recent years, speech recognition technology is applicable to various scenarios such as intelligent (artificial intelligence, AI) conversation, voice assistant, and conference recording of electronic devices.

Currently, electronic devices are typically deployed with a speech recognition system that is provided with a uniformly structured speech recognition model. Accordingly, when the electronic device triggers the voice recognition function, the voice recognition can be performed by utilizing the voice recognition model with the unified architecture.

However, the voice recognition model with the unified architecture cannot be applied to electronic devices with different computing power deployment capabilities.

Disclosure of Invention

The application provides a voice recognition method, a training method of a voice recognition model, equipment and a medium, which are used for providing the voice recognition model with a streaming encoder and a non-streaming encoder, and can be suitable for electronic equipment with different computing power deployment capabilities.

In order to achieve the above purpose, the application adopts the following technical scheme:

in a first aspect, a speech recognition method is provided, applied to an electronic device, the electronic device being associated with a speech recognition model, the speech recognition model comprising a streaming encoder and a non-streaming encoder. That is, a speech recognition model is provided that has both a streaming encoder and a non-streaming encoder.

The electronic device belongs to a first device type; the method comprises the following steps: an audio stream input by a user is received. And carrying out voice recognition on the audio stream based on the stream encoder to obtain a voice recognition result of the audio stream. That is, for an electronic device of a first device type, voice recognition may be performed on an audio stream received by the electronic device based on a stream encoder.

The electronic device is of a second device type; the method comprises the following steps: an audio stream input by a user is received. And performing voice recognition on the audio stream based on the stream encoder and the non-stream encoder to obtain a voice recognition result of the audio stream. That is, for electronic devices of the second device type, the audio streams received by the electronic devices may be speech-recognized in conjunction with the streaming encoder and the non-streaming encoder.

Wherein the computing power deployment capability of the first device type is lower than the computing power deployment capability of the second device type. That is, for low-power deployment capable electronic devices, voice recognition may be performed on audio streams received by the electronic device based on the stream encoder. For high-power deployment capability electronic devices, the audio streams received by the electronic device may be speech-recognized in combination with the streaming and non-streaming encoders.

Thus, considering the streaming nature of the streaming encoder for speech recognition based on audio frames, it is from the dimension of the audio frames for speech recognition, and thus the streaming encoder consumes less computational resources. Furthermore, when the method is applied to the electronic equipment with low computational power deployment capability, the voice recognition is performed based on the stream encoder, so that the computational power resources consumed by the voice recognition can be effectively reduced, the time delay of the voice recognition is reduced, and the efficiency of the voice recognition is improved. When the method is applied to the electronic equipment with high computing power deployment capability, the computing power deployment capability of the electronic equipment is high, so that the method can combine a streaming encoder with a non-streaming encoder to perform voice recognition, thereby not only ensuring the efficiency of voice recognition, but also improving the accuracy of voice recognition.

In a possible implementation manner of the first aspect, performing speech recognition on the audio stream based on the stream encoder includes:

the audio stream is input into the stream encoder according to the audio frames, and the stream encoder encodes the input audio frames to obtain stream encoding characteristics of the audio frames. That is, the audio stream is input into the stream encoder frame by frame, and each audio frame is encoded to obtain the stream encoding characteristic of each audio frame. Wherein the streaming coding feature is used to characterize the audio feature of the audio frame.

And determining a first output probability distribution of the audio frame based on the stream coding feature of the audio frame and the text prediction result of the audio frame. The text prediction result is predicted based on the voice recognition result of the last audio frame of the audio frame. That is, the text prediction result of the audio frame is also combined on the basis of the streaming coding feature of the audio frame to determine the first output probability distribution of the audio frame, so that the accuracy of speech recognition can be improved.

The first output probability distribution represents a probability distribution of the audio frame over a plurality of candidate texts corresponding to the text prediction results. That is, by performing speech recognition on the audio frame, probability values of a plurality of candidate texts corresponding to the audio frame can be obtained. It should be appreciated that if the probability value of an audio frame over a candidate text is large, it is indicated that the audio frame is likely to correspond to the candidate text.

In the possible implementation manner, the streaming encoder encodes the input audio frame to obtain the streaming encoding characteristic of the audio frame, and then the text prediction result of the audio frame is combined to determine the first output probability distribution of the audio frame, so that the efficiency of voice recognition can be ensured, and the accuracy of voice recognition can be improved.

In another possible implementation manner of the first aspect, the speech recognition model further includes a fusion network, and the fusion network may be provided with a feature fusion function.

Accordingly, determining a first output probability distribution for the audio frame based on the stream coding feature of the audio frame and the text prediction result of the audio frame, comprising:

inputting the stream coding feature of the audio frame and the text prediction result of the audio frame into the fusion network, and carrying out fusion processing on the stream coding feature of the audio frame and the text prediction result of the audio frame through the fusion network to obtain a first output probability distribution of the audio frame.

In this possible implementation manner, the fusion network of the speech recognition model is utilized to perform fusion processing on the stream coding feature of the audio frame and the text prediction result of the audio frame, so as to comprehensively reference the stream coding feature of the audio frame and the text prediction result of the audio frame, thereby not only ensuring the efficiency of speech recognition, but also improving the accuracy of speech recognition.

In another possible implementation manner of the first aspect, after determining the first output probability distribution of the audio frame based on the stream coding feature of the audio frame and the text prediction result of the audio frame, the method further includes:

and decoding the first output probability distribution of the audio frame to obtain a voice recognition result of the audio frame. And, the voice recognition result of the audio frame is displayed on the electronic device.

In the possible implementation manner, from the dimension of the audio frame, the audio frame is encoded by a streaming encoder, and then is decoded after feature fusion by a fusion network, so that a voice recognition result of the audio frame is obtained, and voice recognition based on the frame dimension is realized. Further, the voice recognition result of the audio frame is displayed on the electronic device. Therefore, the voice recognition result of the recognized audio frame can be displayed in real time in the process of inputting the audio stream by the user, the real-time display based on the frame recognition result is realized, and the man-machine interaction efficiency is improved.

In another possible implementation manner of the first aspect, performing speech recognition on the audio stream based on the streaming encoder and the non-streaming encoder includes:

And inputting the streaming coding characteristics of the plurality of audio frames in the audio stream into the non-streaming coder, and coding the streaming coding characteristics of the plurality of audio frames by the non-streaming coder to obtain the non-streaming coding characteristics of the audio stream. That is, the streaming coding features of all audio frames in the audio stream are input to the non-streaming coder, and the streaming coding features of all audio frames are coded to obtain the non-streaming coding features of the audio stream. Wherein the non-streaming coding feature is used to characterize the audio characteristics of the audio stream.

Further, a second output probability distribution for the audio stream is determined based on the non-streaming coding characteristic of the audio stream and text predictions for a plurality of audio frames in the audio stream. The text prediction result is predicted based on the voice recognition result of the last audio frame of the audio frame. That is, the text prediction results of all audio frames in the audio stream are combined on the basis of the non-stream coding characteristics of the audio stream to determine the second output probability distribution of the audio stream, so that the accuracy of speech recognition can be improved.

The second output probability distribution represents probability distribution of a plurality of audio frames in the audio stream on a plurality of candidate texts corresponding to the text prediction result respectively. That is, by performing speech recognition on the audio stream, probability values of a plurality of candidate texts corresponding to all audio frames in the audio stream can be obtained, respectively. It should be appreciated that if the probability value of an audio frame over a candidate text is large, it is indicated that the audio frame is likely to correspond to the candidate text.

In this possible implementation, the input audio frames are encoded by a streaming encoder, resulting in streaming coding characteristics of all audio frames in the audio stream. And then the non-stream encoder encodes the stream coding characteristics of all the audio frames in the audio stream to obtain the non-stream coding characteristics of the audio stream. And then, the text prediction results of all the audio frames in the audio stream are combined to determine the second output probability distribution of the audio stream, so that the efficiency of voice recognition can be ensured, and the accuracy of voice recognition can be improved.

Accordingly, determining a second output probability distribution for the audio stream based on the non-streaming coding feature of the audio stream and text predictions for a plurality of audio frames in the audio stream, comprising:

inputting the non-stream coding characteristics of the audio stream and the text prediction results of a plurality of audio frames in the audio stream into the fusion network, and carrying out fusion processing on the non-stream coding characteristics of the audio stream and the text prediction results of a plurality of audio frames in the audio stream through the fusion network to obtain a second output probability distribution of the audio stream.

In this possible implementation manner, the fusion network of the speech recognition model is utilized to perform fusion processing on the non-stream coding feature of the audio stream and the text prediction results of the plurality of audio frames in the audio stream, so as to comprehensively reference the non-stream coding feature of the audio stream and the text prediction results of the plurality of audio frames in the audio stream, thereby not only ensuring the speech recognition efficiency, but also improving the speech recognition accuracy.

In another possible implementation manner of the first aspect, after determining the second output probability distribution of the audio stream based on the non-streaming coding feature of the audio stream and the text prediction results of the plurality of audio frames in the audio stream, the method further includes:

and decoding the second output probability distribution of the audio stream to obtain a voice recognition result of the audio stream. And, the voice recognition result of the audio stream is displayed on the electronic device.

In this possible implementation manner, the streaming encoder and the non-streaming encoder are combined to perform speech recognition, and the speech recognition based on the streaming dimension is realized by performing coding through the streaming encoder, performing coding through the non-streaming encoder, performing feature fusion through the fusion network, and then performing decoding. Further, the voice recognition result of the audio stream is displayed on the electronic device. Therefore, when the user inputs the audio stream, the voice recognition result with higher accuracy can be displayed for the user, and the man-machine interaction efficiency is improved.

In another possible implementation manner of the first aspect, the speech recognition model further includes a downsampling module for downsampling the output characteristics of at least one network layer of the streaming encoder or the non-streaming encoder. Wherein the at least one network layer may be an input layer or an intermediate layer.

Therefore, the length of the output characteristic in the streaming coder or the non-streaming coder can be effectively shortened, the operation amount of the coder is reduced, and the operation amount of a voice recognition model is further reduced, so that the efficiency of voice recognition can be further improved.

In another possible implementation manner of the first aspect, the downsampling module includes a first convolution module and a second convolution module. Wherein the first convolution module is configured to compress a feature sequence length of the output feature. The second convolution module is used for fusing information among different channels.

Therefore, the length of the feature sequence of the output feature is compressed through the first convolution module, the length of the output feature can be effectively shortened, the operand of the voice recognition model is reduced, and accordingly the voice recognition efficiency can be further improved. The second convolution module can fuse information among different channels, so that the information quantity referenced by voice recognition can be increased, and the accuracy of the voice recognition is improved.

In a second aspect, the present application provides a method for training a speech recognition model, the method comprising:

performing iterative training on the initial model based on the audio training data to obtain the speech recognition model; the audio training data includes an audio sample and a labeling text for the audio sample. That is, iterative training of the initial model is performed based on the audio sample and the labeled text of the audio sample to obtain a speech recognition model that is better for speech recognition.

In any iterative training process, inputting the audio training data into a model obtained after the previous iterative training, acquiring a first recognition result of the audio sample by a stream encoder of the model, and determining a first model loss value of the stream encoder based on the first recognition result and the labeling text. That is, a first model loss value of the streaming encoder is determined using a first recognition result obtained by performing speech recognition by the streaming encoder of the model and the markup text. Wherein the first model loss value is capable of representing a difference between the speech recognition result based on the output of the stream encoder and the annotation text.

And obtaining a second recognition result of the audio sample by the streaming encoder and the non-streaming encoder of the model, and determining a second model loss value of the non-streaming encoder based on the second recognition result and the labeling text. That is, the second model loss value of the non-streaming encoder is determined by using the second recognition result obtained by performing speech recognition on the streaming encoder and the non-streaming encoder of the model and the markup text. Wherein the second model loss value is capable of representing a difference between the speech recognition result based on the non-streaming encoder output and the annotated text.

Further, model parameters are adjusted based on the first model loss value and the second model loss value. In this way, the first model loss value can represent the difference between the speech recognition result output by the streaming encoder and the labeling text, and the second model loss value can represent the difference between the speech recognition result output by the non-streaming encoder and the labeling text, so that the model parameters are adjusted according to the first model loss value and the second model loss value, the learning capacity of the model can be improved, and the speech recognition model with better learning capacity can be trained.

The speech recognition model includes a streaming encoder and a non-streaming encoder. The streaming encoder is configured to provide speech recognition services for electronic devices of a first device type, the streaming encoder in combination with the non-streaming encoder being configured to provide speech recognition services for electronic devices of a second device type, the computational power deployment capability of the first device type being lower than the computational power deployment capability of the second device type. That is, for low-power deployment capable electronic devices, voice recognition may be performed on audio streams received by the electronic device based on the stream encoder. For high-power deployment capability electronic devices, the audio streams received by the electronic device may be speech-recognized in combination with the streaming and non-streaming encoders.

In a possible implementation manner of the second aspect, the obtaining, by the streaming encoder of the model, the first recognition result of the audio sample includes:

and encoding the audio sample by the stream encoder to obtain the stream encoding characteristic of the audio sample. That is, in the model training stage, the audio samples may be directly input into the streaming encoder, and then the streaming encoder encodes each audio frame in the audio samples frame by frame to obtain the streaming encoding characteristics of the audio samples. Wherein the stream coding feature is used to characterize the audio characteristics of each audio frame in the audio sample.

And determining a first recognition result of the audio sample based on the stream coding feature of the audio sample and the text prediction result of the audio sample. The text prediction result is predicted based on the voice recognition result of the last audio frame of each audio frame in the audio sample. That is, the first recognition result of the audio sample is determined by combining the text prediction result of the audio sample on the basis of the stream coding feature of the audio sample, and a model having speech recognition based on the stream coding feature combined with the text prediction result can be trained.

In another possible implementation manner of the second aspect, determining the first model loss value of the stream encoder based on the first recognition result and the labeling text includes:

and determining the RNNT loss value of the streaming encoder based on the first recognition result and the labeling text as a first model loss value of the streaming encoder. In this way, by determining the RNNT loss value of the streaming encoder, the first model loss value of the streaming encoder can be quickly determined.

In another possible implementation manner of the second aspect, the method further includes:

A CTC loss value for the streaming encoder is determined based on the streaming encoding characteristics of the audio samples. The first model loss value is determined based on the CTC loss value of the streaming encoder and the RNNT loss value of the streaming encoder. That is, the first model loss value of the streaming encoder is comprehensively determined by combining the CTC loss values of the streaming encoder on the basis of the RNNT loss values of the streaming encoder, increasing the amount of information referred to in determining the model loss value, and improving the accuracy of determining the model loss value.

In another possible implementation manner of the second aspect, the obtaining, by the streaming encoder and the non-streaming encoder of the model, the second recognition result of the audio sample includes:

And, encoding the streaming coding feature of the audio sample by the non-streaming encoder to obtain the non-streaming coding feature of the audio sample. That is, the streaming coding feature of the audio sample is input to the non-streaming coder, and the streaming coding feature of the audio sample is coded to obtain the non-streaming coding feature of the audio sample. Wherein the non-streaming coding feature is used to characterize the audio characteristics of the audio sample.

Further, a second recognition result of the audio sample is determined based on the non-streaming encoding characteristic of the audio sample and the text prediction result of the audio sample. The text prediction result is predicted based on the voice recognition result of the last audio frame of each audio frame in the audio sample. That is, the second recognition result of the audio sample is determined by combining the text prediction result of the audio sample with the non-streaming coding feature of the audio sample, and a model having speech recognition based on the non-streaming coding feature combined with the text prediction result can be trained.

In another possible implementation manner of the second aspect, determining the second model loss value of the non-streaming encoder based on the second recognition result and the labeling text includes:

And determining an RNNT loss value of the non-streaming encoder as the second model loss value based on the second recognition result and the annotation text. Thus, by determining the RNNT loss value of the non-streaming encoder, the second model loss value of the non-streaming encoder can be quickly determined.

a CTC loss value for the non-streaming encoder is determined based on the non-streaming encoding characteristics of the audio samples. The second model loss value is determined based on the CTC loss value of the non-streaming encoder and the RNNT loss value of the non-streaming encoder. That is, combining the CTC loss value of the non-streaming encoder on the basis of the RNNT loss value of the non-streaming encoder to comprehensively determine the second model loss value of the non-streaming encoder increases the amount of information referenced for determining the model loss value, and improves the accuracy of determining the model loss value.

In another possible implementation manner of the second aspect, adjusting the model parameters based on the first model loss value and the second model loss value includes:

and carrying out weighted summation processing on the first model loss value and the second model loss value to obtain a model training loss value. The model parameters are adjusted based on the loss values for the model training. That is, the model parameters are adjusted in combination with the model loss value of the streaming encoder and the model loss value of the non-streaming encoder.

In this way, the first model loss value can represent the difference between the speech recognition result output by the streaming encoder and the labeling text, and the second model loss value can represent the difference between the speech recognition result output by the non-streaming encoder and the labeling text, so that the model parameters are adjusted according to the first model loss value and the second model loss value, the learning capacity of the model can be improved, and the speech recognition model with better learning capacity can be trained.

In a third aspect, the present application provides an electronic device, comprising: a processor and a memory. A voice recognition model associated with the electronic device is stored in the memory, the voice recognition model comprising a streaming encoder and a non-streaming encoder; the electronic equipment belongs to a first equipment type or a second equipment type, and the computing power deployment capability of the first equipment type is lower than that of the second equipment type; the memory is for storing program code and the processor is for invoking the program code stored in the memory to implement any one of the methods provided in the first or second aspects.

In a fourth aspect, there is provided a computer readable storage medium comprising program code which, when run on an electronic device, causes the electronic device to perform any one of the methods provided in the first or second aspects.

In a fifth aspect, there is provided a computer program product comprising program code which, when run on an electronic device, causes the electronic device to perform any one of the methods provided in the first or second aspects.

It should be noted that, the technical effects caused by any implementation manner of the third aspect to the fifth aspect may refer to the technical effects caused by the corresponding implementation manner of the first aspect or the second aspect, and are not described herein.

Drawings

FIG. 1 is a schematic diagram of a speech recognition model according to the related art;

fig. 2 is a schematic diagram of an electronic device according to an embodiment of the present application;

fig. 3 is a schematic hardware structure of an electronic device according to an embodiment of the present application;

fig. 4 is a schematic software structure of an electronic device according to an embodiment of the present application;

FIG. 5 is a flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 6 is a flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a speech recognition model according to an embodiment of the present application;

fig. 8 is a schematic diagram of a frame of a downsampling module according to an embodiment of the present application;

FIG. 9 is a schematic flow chart of a downsampling process and an upsampling process according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a speech recognition mode according to an embodiment of the present application;

FIG. 11 is a flowchart of a training method of a speech recognition model according to an embodiment of the present application;

FIG. 12 is a flowchart of a training method of a speech recognition model according to an embodiment of the present application;

FIG. 13 is a schematic diagram of a model framework during a model training phase according to an embodiment of the present application;

FIG. 14 is a schematic diagram of a speech recognition device according to an embodiment of the present application;

fig. 15 is a schematic frame diagram of a training device for a speech recognition model according to an embodiment of the present application.

Detailed Description

In the description of the present application, "/" means "or" unless otherwise indicated, for example, A/B may mean A or B. "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. Furthermore, "at least one" means one or more, and "a plurality" means two or more. The terms "first," "second," and the like do not limit the number and order of execution, and the terms "first," "second," and the like do not necessarily differ.

It should be noted that, in the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

The voice recognition method provided by the embodiment of the application can be applied to voice recognition scenes of electronic equipment, such as intelligent conversation, voice assistant, conference record and other scenes of the electronic equipment.

Speech recognition is a technical means of converting speech into corresponding text. Currently, electronic devices are typically deployed with a speech recognition system that is provided with a uniformly structured speech recognition model. Accordingly, when the electronic device triggers the voice recognition function, the voice recognition can be performed by utilizing the voice recognition model with the unified architecture.

By way of example, the speech recognition system may be a Transducer-based speech recognition system in which a speech recognition model as shown in fig. 1 may be provided. Fig. 1 is a schematic diagram of a speech recognition model according to the related art. Referring to fig. 1, the speech recognition model may include three modules: encoder, prediction network, and fusion network.

Wherein the encoder is for processing the feature sequence of the audio to extract valid features from the feature sequence of the audio. In general, the encoder may be provided with more model parameters to implement the above-described function of extracting valid features. For example, the encoder may employ a transducer-based architecture or a Conformer-based architecture, both of which are based on an attention mechanism for processing a sequence of features of the audio. The prediction network is used to predict future text based on the historical speech recognition results of the audio (i.e., the recognized text). The fusion network provides a feature fusion function and is used for carrying out fusion processing on the output result of the encoder and the output result of the prediction network, so as to obtain a final voice recognition result.

However, the voice recognition model with the unified architecture cannot be applied to electronic devices with different computing power deployment capabilities. It should be appreciated that the computing power deployment capabilities of different electronic devices may also differ in terms of computing power deployment capabilities due to differences in the central processor (central processing unit, CPU) or graphics processor (graphics processing unit, GPU) used by the different electronic devices. For example, compared with a large core used by a high-end device, the low-end device has lower computing power deployment capability of a middle core or a small core used by the low-end device, so that the voice recognition delay of the low-end device is obviously increased compared with that of the high-end device, and the use experience of a user is greatly influenced.

In view of this, the embodiment of the application provides a voice recognition method and a voice recognition model with both a streaming encoder and a non-streaming encoder. For electronic devices of a first device type, voice recognition may be performed on an audio stream received by the electronic device based on a stream encoder. For electronic devices of the second device type, the audio streams received by the electronic devices may be speech-recognized in conjunction with a streaming encoder and a non-streaming encoder. Wherein the computing power deployment capability of the first device type is lower than the computing power deployment capability of the second device type.

That is, for low-power deployment capable electronic devices, voice recognition may be performed on audio streams received by the electronic device based on the stream encoder. For high-power deployment capability electronic devices, the audio streams received by the electronic device may be speech-recognized in combination with the streaming and non-streaming encoders. Thus, the electronic equipment with different computing power deployment capabilities can be considered. Among them, considering the streaming characteristic that the streaming encoder performs speech recognition based on the audio frame, it performs speech recognition from the dimension of the audio frame, and thus the streaming encoder consumes less computational resources. Furthermore, when the method is applied to the electronic equipment with low computational power deployment capability, the voice recognition is performed based on the stream encoder, so that the computational power resources consumed by the voice recognition can be effectively reduced, the time delay of the voice recognition is reduced, and the efficiency of the voice recognition is improved. When the method is applied to the electronic equipment with high computing power deployment capability, the computing power deployment capability of the electronic equipment is high, so that the method can combine a streaming encoder with a non-streaming encoder to perform voice recognition, thereby not only ensuring the efficiency of voice recognition, but also improving the accuracy of voice recognition.

In one possible implementation, the voice recognition method provided in the embodiment of the present application may be applied to the electronic device 200 shown in fig. 2. Fig. 2 is a schematic diagram of an electronic device according to an embodiment of the present application.

The electronic device 200 may be a terminal device. By way of example, the electronic device 200 may be at least one of a smart phone, a smart watch, a desktop computer, a laptop computer, a virtual reality terminal, an augmented reality terminal, a wireless terminal, a laptop portable computer, and the like.

In an embodiment of the present application, the electronic device 200 is provided with a voice recognition function. Referring to fig. 2, by performing an operation on the electronic device 200, a user can trigger the electronic device 200 to receive an audio stream (such as voice) input by the user, and further perform voice recognition on the audio stream input by the user, so as to obtain a voice recognition result of the audio stream. Further, the voice recognition result of the audio stream may be displayed on the display screen of the electronic device 200.

In the embodiment of the present application, the electronic device 200 belongs to a first device type, and the electronic device 200 is configured to receive an audio stream input by a user; and performing voice recognition on the audio stream by a stream encoder based on the voice recognition model to obtain a voice recognition result of the audio stream. Alternatively, the electronic device 200 belongs to the second device type, and the electronic device 200 is configured to receive an audio stream input by a user; and carrying out voice recognition on the audio stream by a streaming encoder and a non-streaming encoder based on the voice recognition model to obtain a voice recognition result of the audio stream.

Exemplary, a schematic structural diagram of the electronic device 200 in fig. 2 is shown in fig. 3. Fig. 3 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Referring to fig. 3, the electronic device may include a processor 310, an external memory interface 320, an internal memory 321, a universal serial bus (universal serial bus, USB) interface 330, a charge management module 340, an antenna 1, an antenna 2, a mobile communication module 350, a wireless communication module 360, an audio module 370, a speaker 370A, a receiver 370B, a microphone 370C, an earphone interface 370D, a sensor module 380, keys 390, a display 391, and the like. Wherein the sensor module 380 may include a pressure sensor 380A, a touch sensor 380B, etc.

It should be understood that the structure illustrated in the embodiments of the present application does not constitute a specific limitation on the electronic device. In other embodiments of the application, the electronic device may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 310 may include one or more processing units, such as: the processor 310 may include an application processor (application processor, AP), a modem processor, a graphics processor, an image signal processor (image signal processor, ISP), a controller, a memory, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural Network Processor (NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller can be a neural center and a command center of the electronic device. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 310 for storing instructions and data. In some embodiments, the memory in the processor 310 is a cache memory. The memory may hold instructions or data that the processor 310 has just used or recycled. If the processor 310 needs to reuse the instruction or data, it may be called directly from the memory. Repeated accesses are avoided and the latency of the processor 310 is reduced, thereby improving the efficiency of the system.

In some embodiments, processor 310 may include one or more interfaces. The interfaces may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

Wherein the I2S interface may be used for audio communication. In some embodiments, the processor 310 may contain multiple sets of I2S buses. The processor 310 may be coupled to the audio module 370 via an I2S bus to enable communication between the processor 310 and the audio module 370. For example, in an embodiment of the present application, processor 310 may send instructions to audio module 370 via the I2S bus to trigger audio module 370 to receive an audio stream entered by a user. As another example, in an embodiment of the present application, the audio module 370 may send the received audio stream to the processor 310 through the I2S bus to trigger the processor 310 to perform the voice recognition method provided by the embodiment of the present application.

PCM interfaces may also be used for audio communication to sample, quantize and encode analog signals. In some embodiments, the audio module 370 and the wireless communication module 360 may be coupled by a PCM bus interface.

The UART interface is a universal serial data bus for asynchronous communications. The bus may be a bi-directional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is typically used to connect the processor 310 with the wireless communication module 360.

The MIPI interface may be used to connect the processor 310 to peripheral devices such as the display 391. The MIPI interface may include a display screen serial interface (display serial interface, DSI) or the like. In some embodiments, processor 310 and display 391 communicate via a DSI interface to implement display functionality of an electronic device. For example, in an embodiment of the present application, the processor 310 may send an instruction to the display 391 through the DSI interface to trigger the display 391 to display the speech recognition result obtained through the speech recognition.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal or as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 310 with the display 391, wireless communication module 360, audio module 370, sensor module 380, and the like.

The USB interface 330 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like.

It should be understood that the connection relationship between the modules illustrated in the embodiments of the present application is only illustrative, and does not limit the structure of the electronic device. In other embodiments of the present application, the electronic device may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments.

The charge management module 340 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger.

The wireless communication function of the electronic device may be implemented by the antenna 1, the antenna 2, the mobile communication module 350, the wireless communication module 360, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device may be used to cover a single or multiple communication bands. The mobile communication module 350 may provide a solution for wireless communication including 2G/3G/4G/5G, etc. applied on an electronic device. The wireless communication module 360 may provide a solution for wireless communication applied on an electronic device.

In some embodiments, the antenna 1 and the mobile communication module 350 of the electronic device are coupled, and the antenna 2 and the wireless communication module 360 are coupled, so that the electronic device can communicate with the network and other devices through wireless communication technology.

The electronic device implements display functions through a GPU, a display 391, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 391 and the application processor.

The display 391 is for displaying images, videos, and the like. The display 391 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED) or an active-matrix organic light-emitting diode (matrix organic light emitting diode), a flexible light-emitting diode (FLED), a Miniled, microLed, micro-oeled, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, the electronic device may include 1 or N display screens 391, N being a positive integer greater than 1. For example, in the embodiment of the present application, the display 391 is used to display the voice recognition result obtained through voice recognition.

The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning. Applications such as intelligent cognition of electronic devices can be realized through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, etc. For example, in the embodiment of the present application, the speech recognition method provided by the embodiment of the present application is implemented by using a speech recognition model provided by the NPU.

The external memory interface 320 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device. The external memory card communicates with the processor 310 through an external memory interface 320 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.

The internal memory 321 may be used to store computer executable program code comprising instructions. The processor 310 executes various functional applications of the electronic device and data processing by executing instructions stored in the internal memory 321. The internal memory 321 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device (e.g., audio data, phonebook, etc.), and so forth. In addition, the internal memory 321 may include a high-speed random access memory, and may also include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like.

The electronic device may implement audio functionality through an audio module 370, a speaker 370A, a receiver 370B, a microphone 370C, an ear speaker interface 370D, an application processor, and the like. Such as music playing, recording, etc.

The audio module 370 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 370 may also be used to encode and decode audio signals. In some embodiments, the audio module 370 may be disposed in the processor 310, or some of the functional modules of the audio module 370 may be disposed in the processor 310.

Speaker 370A, also known as a "horn," is used to convert audio electrical signals into sound signals. The electronic device may listen to music, or to hands-free conversations, through speaker 370A. A receiver 370B, also referred to as a "earpiece", is used to convert the audio electrical signal into a sound signal. When the electronic device picks up a phone call or voice message, the voice can be picked up by placing the receiver 370B close to the human ear. Microphone 370C, also referred to as a "microphone," is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can sound near the microphone 370C through the mouth, inputting a sound signal to the microphone 370C. The electronic device may be provided with at least one microphone 370C. In other embodiments, the electronic device may be provided with two microphones 370C, which may also perform a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device may also be provided with three, four, or more microphones 370C to enable collection of sound signals, noise reduction, identification of sound sources, directional recording functions, etc. The earphone interface 370D is for connecting a wired earphone. The earphone interface 370D may be a USB interface 330, or may be a 3.5mm open mobile electronic device platform (open mobile terminal platform, OMTP) standard interface, or a american cellular telecommunications industry association (cellular telecommunications industry association of e usa, ctia) standard interface.

The pressure sensor 380A is configured to sense a pressure signal and convert the pressure signal into an electrical signal. In some embodiments, pressure sensor 380A may be disposed on display 391. The pressure sensor 380A is of various types, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like.

The touch sensor 380B, also referred to as a "touch panel". The touch sensor 380B may be disposed on the display 391, and the touch sensor 380B and the display 391 form a touch screen, which is also referred to as a "touch screen". The touch sensor 380B is for detecting a touch operation acting thereon or thereabout. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to the touch operation may be provided through the display 391. In other embodiments, the touch sensor 380B may also be disposed on a surface of the electronic device other than where the display 391 is located.

It should be noted that the structure shown in fig. 3 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown in fig. 3, or may combine some components, or may be arranged with different components.

The software system of the electronic device may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In the embodiment of the application, an Android system with a layered architecture is taken as an example, and the software structure of the electronic equipment is illustrated. Fig. 4 is a schematic software structure of an electronic device according to an embodiment of the present application.

The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, an application layer, an application framework layer, an Zhuoyun row (Android run), a system library, and a kernel layer, respectively.

The application layer may include a series of application packages. As shown in fig. 4, the application package may include applications for cameras, gallery, calendar, talk, map, navigation, WLAN, bluetooth, music, video, short messages, etc.

The application framework layer provides an application programming interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions.

As shown in fig. 4, the application framework layer may include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, and the like.

In the embodiment of the application, the man-machine interaction between the user and the electronic equipment is realized through the application program layer and the application program framework layer. For example, the electronic device implements a process of receiving an audio stream input by a user through the application layer and the application framework layer.

The system library may include a plurality of functional modules. For example: surface manager (surface manager), media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., openGL ES), two-dimensional graphics engines (e.g., SGL), etc.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

Fig. 5 is a flowchart of a voice recognition method according to an embodiment of the present application. Referring to fig. 5, the method is applied to an electronic device associated with a speech recognition model including a streaming encoder and a non-streaming encoder. Wherein the electronic device is of a first device type; the method comprises the following steps S501-S502:

S501, the electronic equipment receives an audio stream input by a user.

S502, performing voice recognition on the audio stream based on the stream encoder to obtain a voice recognition result of the audio stream.

Or, the electronic device is of a second device type; the method includes the following S503-S504:

s503, the electronic equipment receives the audio stream input by the user.

S504, performing voice recognition on the audio stream based on the stream encoder and the non-stream encoder to obtain a voice recognition result of the audio stream.

Wherein the computing power deployment capability of the first device type is lower than the computing power deployment capability of the second device type.

The technical scheme provided by the embodiment of the application provides a voice recognition model with a streaming encoder and a non-streaming encoder. For electronic devices of a first device type, voice recognition may be performed on an audio stream received by the electronic device based on a stream encoder. For electronic devices of the second device type, the audio streams received by the electronic devices may be speech-recognized in conjunction with a streaming encoder and a non-streaming encoder. Wherein the computing power deployment capability of the first device type is lower than the computing power deployment capability of the second device type. That is, for low-power deployment capable electronic devices, voice recognition may be performed on audio streams received by the electronic device based on the stream encoder. For high-power deployment capability electronic devices, the audio streams received by the electronic device may be speech-recognized in combination with the streaming and non-streaming encoders. Thus, the electronic equipment with different computing power deployment capabilities can be considered. Among them, considering the streaming characteristic that the streaming encoder performs speech recognition based on the audio frame, it performs speech recognition from the dimension of the audio frame, and thus the streaming encoder consumes less computational resources. Furthermore, when the method is applied to the electronic equipment with low computational power deployment capability, the voice recognition is performed based on the stream encoder, so that the computational power resources consumed by the voice recognition can be effectively reduced, the time delay of the voice recognition is reduced, and the efficiency of the voice recognition is improved. When the method is applied to the electronic equipment with high computing power deployment capability, the computing power deployment capability of the electronic equipment is high, so that the method can combine a streaming encoder with a non-streaming encoder to perform voice recognition, thereby not only ensuring the efficiency of voice recognition, but also improving the accuracy of voice recognition.

Fig. 6 is a flowchart of a voice recognition method according to an embodiment of the present application. Referring to fig. 6, a description will be given of an example of two scenarios of the electronic device of the first device type and the electronic device of the second device type. Wherein the computing power deployment capability of the first device type is lower than the computing power deployment capability of the second device type. For example, the first device type may be a low-computing-power deployment-capable device type, such as a low-end device. The second device type may be a high-computing-power deployment-capable device type, such as a high-end device. It should be noted that the device type of the electronic device may be divided based on the type of the central processor or the graphics processor used by the electronic device.

In a scenario where the electronic device belongs to a first device type, the method comprises the following S601-S605:

s601, the electronic equipment receives an audio stream input by a user.

In some embodiments, the electronic device may be provided with a speech recognition control for triggering speech recognition of the audio stream input by the user. For example, the electronic device receives a triggering operation of the voice recognition control by a user, and receives an audio stream input by the user through a microphone built in the electronic device in response to the triggering operation of the voice recognition control.

Alternatively, in other embodiments, the electronic device may be provided with a wake-up word that turns on the speech recognition function, which is used to trigger speech recognition of the audio stream entered by the user. For example, the electronic device receives a wake-up instruction from a user, and receives an audio stream input by the user through a microphone built in the electronic device in response to the wake-up word carried in the wake-up instruction being the same as the set wake-up word.

It should be understood that the audio stream may be a piece of voice input by the user to the electronic device, or may be a piece of pre-recorded audio.

In the embodiment of the application, a voice recognition model with both a streaming encoder and a non-streaming encoder is provided, and the voice recognition model can be compatible with electronic equipment with different equipment types (namely different computing power deployment capabilities), namely, the voice recognition model can be suitable for electronic equipment with low computing power deployment capabilities and electronic equipment with high computing power deployment capabilities.

Fig. 7 is a schematic diagram of a speech recognition model according to an embodiment of the present application. Referring to fig. 7, a framework of a speech recognition model employing a concatenated encoder comprising a streaming encoder and a non-streaming encoder (or referred to as a high-latency encoder), a prediction network, and a fusion network is shown.

Wherein the streaming encoder is configured to process a feature sequence of the audio frames input in real time to extract valid features from the feature sequence of the audio frames. The output of the stream encoder may be used as an input to a non-stream encoder that is used to process the output of the stream encoder to further extract the valid features. The prediction network is used to predict future text based on the historical speech recognition results of the audio (i.e., the recognized text). The output of the stream encoder and the output of the non-stream encoder are required to be input into a fusion network, and the fusion network is provided with a feature fusion function and is used for carrying out fusion processing on the output result of the cascade encoder and the output result of the prediction network, so that a final voice recognition result is obtained.

In the speech recognition model shown in fig. 7, the streaming encoder shares the prediction network and the fusion network with the non-streaming encoder, so that the number of parameters of the speech recognition model can be effectively reduced, and the computation amount and the power consumption of the speech recognition model can be reduced. The following describes a speech recognition flow based on the speech recognition model shown in fig. 7.

S602, the electronic equipment encodes the input audio frames by the streaming encoder according to the audio frame input streaming encoder to obtain streaming encoding characteristics of the audio frames.

Wherein the streaming coding feature is used to characterize the audio feature of the audio frame. It should be understood that streaming refers to speech recognition while inputting audio. For example, in some embodiments, in receiving an audio stream input by a user, the audio stream is input to the streaming encoder frame by frame, i.e., each time an audio frame is received, the audio frame is input to the streaming encoder. Further, the stream encoder encodes the input audio frame to obtain a stream coding feature of the audio frame.

In some embodiments, the speech recognition model further comprises a downsampling module for downsampling the output characteristics of at least one network layer of the streaming encoder or the non-streaming encoder.

In a process of performing speech recognition on an audio stream based on a streaming encoder, an output feature of at least one network layer in the streaming encoder may be downsampled by the downsampling module. Wherein the at least one network layer may be an input layer or an intermediate layer.

For example, in some embodiments, the output characteristics of each network layer in the streaming encoder may be downsampled by the downsampling module. As another example, in other embodiments, the downsampling module may downsample the output characteristics of a portion of the network layers in the streaming encoder, such as by performing the downsampling process once per a predetermined number of network layers. The preset number is a preset number, such as 2, 5 or other values.

Therefore, the length of the output characteristic in the stream encoder can be effectively shortened, the operation amount of the stream encoder is reduced, and the operation amount of a voice recognition model is further reduced, so that the voice recognition efficiency can be further improved.

For example, in some embodiments, the downsampling module may include a first convolution module and a second convolution module. Wherein the first convolution module is configured to compress a feature sequence length of the output feature. The second convolution module is used for fusing information among different channels. Accordingly, the downsampling process may be: and carrying out first convolution processing on the output characteristic through the first convolution module to obtain the output characteristic after the first convolution processing. And performing second convolution processing on the output characteristics after the first convolution processing through the second convolution module to obtain the output characteristics after the second convolution processing, wherein the output characteristics after the second convolution processing are used as the output characteristics after the downsampling processing.

By way of example, the first convolution module may be a depth-wise convolution module, such as a depth-wise convolution module having a convolution kernel (kernel) size of 5 and a stride (stride) of 2. Thus, the first convolution module with low parameter quantity and small operation quantity is provided, and the feature sequence length of the output feature can be halved.

For example, consider a depth-wise convolution module with a convolution kernel size of 5 and a step size of 2 as an example, assuming an output characteristic of [ h ] ₁ ,h ₂ ,…,h _T ]T represents the number of frames, T is a positive integer greater than 2. Then the output characteristic h can be used ₁ ,h ₂ ,…,h _T ]Inputting the depth-wise convolution module, and performing depth-wise convolution processing on the output characteristic through the depth-wise convolution module to obtain a compressed characteristic sequence [ h ] with the length of the characteristic sequence halved ₁ ,h ₂ ,…,h _T/2 ]That is, the output characteristics after the first convolution processing are obtained.

Also exemplary, the second convolution module may be a point-wise convolution module. As another example, the output features after the first convolution process, i.e., the compressed feature sequence [ h ] ₁ ,h ₂ ,…,h _T/2 ]Inputting the point-wise convolution module, and carrying out the point-wise convolution on the compressed characteristic sequence [ h ] ₁ ,h ₂ ,…,h _T/2 ]The point-wise convolution processing is carried out, and a compressed characteristic sequence [ h ] fused with information among different channels can be obtained ₁ ^′ ,h ₂ ^′ ,…,h _T/2 ^′ ]I.e. the output characteristics after the downsampling process are obtained.

Fig. 8 is a schematic diagram of a frame of a downsampling module according to an embodiment of the present application. Referring to fig. 8, the first convolution module may be a one-dimensional depth-wise convolution module shown in fig. 8, and the second convolution module may be a one-dimensional point-wise convolution module shown in fig. 8. The output characteristic may be a characteristic sequence 801 shown in fig. 8, and after the one-dimensional depth-wise convolution module and the one-dimensional point-wise convolution module perform downsampling processing on the characteristic sequence 801, a final compressed characteristic sequence 802 may be obtained.

Thus, a feature sequence compressed downsampling module is provided. Considering that a considerable information redundancy exists between adjacent audio frames, the first convolution module of the downsampling module can compress the characteristic sequence length of the output characteristics of the audio frames in the network layer, so that the length of the output characteristics can be effectively shortened, the operation amount of a voice recognition model is reduced, and the voice recognition efficiency can be further improved. The second convolution module of the downsampling module can fuse information among different channels, so that the information quantity referenced by voice recognition can be increased, and the accuracy of the voice recognition is improved.

S603, determining a first output probability distribution of the audio frame based on the stream coding feature of the audio frame and a text prediction result of the audio frame.

The text prediction result is predicted based on the voice recognition result of the last audio frame of the audio frame. In some embodiments, the electronic device may input the speech recognition result of the previous audio frame of the audio frame into a prediction network of the speech recognition model, and predict, by using the prediction network of the speech recognition model, the current audio frame based on the speech recognition result of the previous audio frame, to obtain the text prediction result of the audio frame. Therefore, the first output probability distribution of the audio frame is determined by combining the text prediction result of the audio frame on the basis of the stream coding characteristic of the audio frame, and the accuracy of voice recognition can be improved. It will be appreciated that if the current audio frame is the first audio frame of the audio stream, default text may be entered into the prediction network of the speech recognition model to obtain a text prediction result for that audio frame. For example, the default text may be text that occurs more frequently in the speech recognition result, such as hotwords.

The first output probability distribution represents a probability distribution of the audio frame over a plurality of candidate texts corresponding to the text prediction result. For example, the first output probability distribution may include probability values for a plurality of candidate texts corresponding to the audio frame. Taking the audio frame M as an example, the corresponding plurality of candidate texts may be M1 (probability value 0.9), M2 (probability value 0.7), M3 (probability value 0.3), and so on. It should be appreciated that if the probability value of an audio frame over a candidate text is large, it is indicated that the audio frame is likely to correspond to the candidate text.

In some embodiments, after the stream coding feature of the audio frame and the text prediction result of the audio frame are obtained, the stream coding feature of the audio frame and the text prediction result of the audio frame are input into the fusion network, and fusion processing is performed on the stream coding feature of the audio frame and the text prediction result of the audio frame through the fusion network, so as to obtain a first output probability distribution of the audio frame. The fusion process may be an accumulation process.

Therefore, the fusion network of the voice recognition model is utilized to fuse the streaming coding feature of the audio frame and the text prediction result of the audio frame so as to comprehensively reference the streaming coding feature of the audio frame and the text prediction result of the audio frame, thereby not only ensuring the efficiency of voice recognition, but also improving the accuracy of voice recognition.

It should be noted that, the electronic device may first obtain the streaming coding feature of the audio frame, and then obtain the text prediction result of the audio frame. Alternatively, the electronic device may obtain the text prediction result of the audio frame first, and then obtain the stream coding feature of the audio frame. Alternatively, the electronic device may also synchronously obtain the streaming coding feature of the audio frame and the text prediction result of the audio frame. The embodiment of the application does not limit the sequence of acquiring the stream coding characteristics of the audio frame and the text prediction result of the audio frame.

S604, decoding the first output probability distribution of the audio frame to obtain a voice recognition result of the audio frame.

The speech recognition result of the audio frame may be a text recognition result, such as a word, corresponding to the audio frame. Thus, the decoding based on the audio frame is realized, and the operation amount and time complexity of the decoding can be effectively reduced.

In the above-described embodiments, considering the streaming characteristic of the streaming encoder performing speech recognition based on the audio frame, it performs speech recognition from the dimension of the audio frame, and thus the streaming encoder consumes less computational resources. Furthermore, when the method is applied to the electronic equipment with low computational power deployment capability, the voice recognition is performed based on the stream encoder, so that the computational power resources consumed by the voice recognition can be effectively reduced, the time delay of the voice recognition is reduced, and the efficiency of the voice recognition is improved.

S605, displaying a voice recognition result of the audio frame on the electronic equipment.

Therefore, the voice recognition result of the recognized audio frame can be displayed in real time in the process of inputting the audio stream by the user, the real-time display based on the frame recognition result is realized, and the man-machine interaction efficiency is improved.

In the embodiment shown in S601 to S605, in the scenario where the electronic device belongs to the first device type, the audio stream is subjected to speech recognition based on the stream encoder, so as to obtain a speech recognition result of the audio stream. It should be noted that S601-S605 above describe the speech recognition process of one audio frame, and it is understood that after the speech recognition process of the audio frame is performed, the speech recognition process of the next audio frame is continued until the recognition of a plurality of audio frames of the audio stream is completed. Further, the user may be presented with text-by-text display effects in the electronic device.

In one example, an audio stream [ x ] of user input is received at an electronic device ₁ ,x ₂ ,…,x _T ]During the process of receiving a frame of audio x by the electronic equipment ₁ Input stream type encoder for inputting the frame audio x by stream type encoder ₁ Coding to obtain the frame audio x ₁ Stream coding feature h of (2) ₁ ^s . And obtaining the frame audio x through the prediction network ₁ Is a result of text prediction of (a). Further, the frame of audio x ₁ Stream coding feature h of (2) ₁ ^s And the frame audio x ₁ Inputting the text prediction result of the frame into a fusion network, and calculating the audio x of the frame through the fusion network ₁ Is provided. And then to the frame audio x ₁ Decoding the output probability distribution of (2) to obtain the frame audio x ₁ And displaying the speech recognition result in the electronic deviceAs a result.

In a scenario where the electronic device belongs to the second device type, the method comprises the following S606-S611:

s606, the electronic equipment receives the audio stream input by the user.

It should be noted that, the content of S606 may be referred to S601, and will not be described again.

S607, the electronic device encodes the audio stream according to the audio frame input stream type encoder, and the stream type encoder encodes the input audio frame to obtain the stream type encoding characteristic of the audio frame.

It should be noted that, the content of S607 may be referred to S602, and will not be described again.

S608, inputting the streaming coding features of the plurality of audio frames in the audio stream into a non-streaming coder, and coding the streaming coding features of the plurality of audio frames through the non-streaming coder to obtain the non-streaming coding features of the audio stream.

Wherein the non-streaming coding feature is used to characterize the audio features of the audio stream. It should be understood that non-streaming refers to speech recognition after all inputs are completed. For example, in some embodiments, after obtaining the streaming coding features of all audio frames in the audio stream, the streaming coding features of all audio frames in the audio stream may be input to the non-streaming encoder, which encodes the streaming coding features of all audio frames to obtain the non-streaming coding features of the audio stream.

It should be noted that, in the process of performing speech recognition on an audio stream based on a stream encoder and a non-stream encoder, if a downsampling module is used in the stream encoding stage, an upsampling process is also required in the non-stream encoding stage to obtain the features of the original feature sequence length, so as to avoid performance loss. Fig. 9 is a schematic flow chart of a downsampling process and an upsampling process according to an embodiment of the present application. Referring to fig. 9, taking an example of a two-time downsampling process in the stream encoding stage, the original feature sequence of the audio stream may be the feature sequence 901 shown in fig. 9, which has a feature sequence length of 12. The feature sequence obtained by the first downsampling process may be the feature sequence 902 shown in fig. 9, where the feature sequence length is 6. The feature sequence obtained by the second downsampling process may be the feature sequence 903 shown in fig. 9, where the feature sequence length is 3. Furthermore, in the non-streaming encoding phase, the audio stream needs to be up-sampled twice. The feature sequence obtained by the first upsampling process may be the feature sequence 904 shown in fig. 9, where the feature sequence length is 6. The feature sequence obtained by the second upsampling process may be the feature sequence 905 shown in fig. 9, where the feature sequence length is 12.

Further, in some embodiments, in the process of performing speech recognition on an audio stream based on a streaming encoder and a non-streaming encoder, the output features of at least one network layer in the non-streaming encoder may also be subjected to a downsampling process by the downsampling module.

For example, in some embodiments, the output characteristics of each network layer in the non-streaming encoder may be downsampled by the downsampling module. As another example, in other embodiments, the downsampling module may downsample the output characteristics of a portion of the network layers in the non-streaming encoder, such as by performing the downsampling process once per a preset number of network layers.

It should be noted that, in the case of using the downsampling module in the non-streaming encoding stage, the upsampling process is also required before the non-streaming encoding stage outputs the non-streaming encoding feature, so as to obtain the feature of the original feature sequence length, so as to avoid performance loss.

Therefore, the length of the output characteristic in the non-streaming coder can be effectively shortened, the operation amount of the non-streaming coder is reduced, and the operation amount of a voice recognition model is further reduced, so that the efficiency of voice recognition can be further improved.

S609, determining a second output probability distribution of the audio stream based on the non-stream coding feature of the audio stream and text prediction results of a plurality of audio frames in the audio stream.

The text prediction result is predicted based on the voice recognition result of the last audio frame of the audio frame. The text prediction result obtaining process is referred to S603, and will not be described in detail. Therefore, the second output probability distribution of the audio stream is determined by combining the text prediction results of all audio frames in the audio stream on the basis of the non-stream coding characteristics of the audio stream, so that the accuracy of voice recognition can be improved.

The second output probability distribution represents probability distribution of a plurality of audio frames in the audio stream on a plurality of candidate texts corresponding to the text prediction result. For example, the second output probability distribution may include probability values for a plurality of candidate texts corresponding respectively to all audio frames in the audio stream. It should be appreciated that if the probability value of an audio frame over a candidate text is large, it is indicated that the audio frame is likely to correspond to the candidate text.

In some embodiments, after obtaining the non-stream coding feature of the audio stream and the text prediction results of the plurality of audio frames in the audio stream, inputting the non-stream coding feature of the audio stream and the text prediction results of the plurality of audio frames in the audio stream into the fusion network, and performing fusion processing on the non-stream coding feature of the audio stream and the text prediction results of the plurality of audio frames in the audio stream through the fusion network to obtain a second output probability distribution of the audio stream.

Therefore, by utilizing the fusion network of the voice recognition model, the non-stream coding characteristic of the audio stream and the text prediction results of a plurality of audio frames in the audio stream are fused, so that the non-stream coding characteristic of the audio stream and the text prediction results of a plurality of audio frames in the audio stream are comprehensively referred, the voice recognition efficiency can be ensured, and the voice recognition accuracy can be improved.

S610, decoding the second output probability distribution of the audio stream to obtain a voice recognition result of the audio stream.

The speech recognition result of the audio stream may be a text recognition result, such as a piece of text, corresponding to the audio stream.

In the above embodiment, when the method is applied to the electronic device with high power deployment capability, since the power deployment capability of the electronic device is high, the method can combine the streaming encoder with the non-streaming encoder to perform voice recognition, so that the efficiency of voice recognition can be ensured, and the accuracy of voice recognition can be improved.

S611, displaying a voice recognition result of the audio stream on the electronic equipment.

Therefore, when the user inputs the audio stream, the voice recognition result with higher accuracy can be displayed for the user, and the man-machine interaction efficiency is improved.

In the embodiment shown in S606-S611, in the scenario where the electronic device belongs to the second device type, the audio stream is subjected to speech recognition based on the streaming encoder and the non-streaming encoder, so as to obtain a speech recognition result of the audio stream.

In one example, an audio stream [ x ] of user input is received at an electronic device ₁ ,x ₂ ,…,x _T ]During the process of receiving a frame of audio x by the electronic equipment ₁ Input stream type encoder for inputting the frame audio x by stream type encoder ₁ Coding to obtain the frame audio x ₁ Stream coding feature h of (2) ₁ ^s And audio x of the frame ₁ Stream coding feature h of (2) ₁ ^s And (5) caching. The process of the streaming encoder is circularly executed until the input of the audio stream is finished, and the streaming encoding characteristics [ h ] of multi-frame audio in the audio stream can be obtained through buffering ₁ ^s , h ₂ ^s ,…]. And obtaining a text prediction result of multi-frame audio in the audio stream through a prediction network. Further, all buffered stream coding features [ h ] ₁ ^s , h ₂ ^s ,…]Input non-streaming encoder, by which the feature [ h ] is encoded in streaming mode ₁ ^s , h ₂ ^s ,…]Coding to obtain non-stream coding characteristic (h) of audio stream ₁ ^ns , h ₂ ^ns ,…]. Characterizing non-streaming coding of the audio stream [ h ] ₁ ^ns , h ₂ ^ns ,…]Inputting the text prediction result of the multi-frame audio in the audio stream into a fusion network, and calculating the output probability distribution of each frame of audio frame by frame through the fusion network so as to obtain the output probability distribution of the audio stream. Decoding the output probability distribution of the audio stream to obtain a speech recognition junction of the audio stream And finally, displaying the final voice recognition result in the electronic equipment.

In the speech recognition model provided by the application, two speech recognition modes are provided, namely a streaming speech recognition mode and a non-streaming speech recognition mode. Fig. 10 is a schematic diagram of a speech recognition mode according to an embodiment of the present application. Referring to fig. 10, two speech recognition modes, a streaming speech recognition mode and a non-streaming speech recognition mode, are shown. In the streaming voice recognition mode, the audio stream is encoded by a streaming encoder, is subjected to feature fusion by a fusion network and is then decoded, so that a voice recognition result is obtained. It should be understood that the voice recognition flows shown in the above-described S601 to S605 correspond to the streaming voice recognition modes. In the non-stream type voice recognition mode, the audio stream is respectively encoded by a stream type encoder, encoded by the non-stream type encoder, subjected to feature fusion by a fusion network and then decoded, so that a voice recognition result is obtained. It should be understood that the voice recognition flows shown in the above-mentioned S606-S611 correspond to the non-streaming voice recognition modes.

Exemplary, table 1 is a comparison schematic diagram of the effects of voice recognition according to an embodiment of the present application. Referring to table 1, the original model represents a speech recognition model employed in the related art, such as a large language model LLM (Large Language Model) or a speech recognition model Wav2Vec2. The new model represents the speech recognition model employed in the embodiments of the present application. The parameter amount represents a model parameter amount of the speech recognition model. It should be understood that a larger model parameter number may not only lead to an increase in the storage space of the electronic device, but also cause a problem of an increase in the amount of computation. For example, taking a speech recognition model adopted in the related art as an example, for an encoder based on an attention mechanism, the operation amount of the encoder is increased due to the time complexity of the attention mechanism, so that the power consumption is increased, the time delay is increased, and the final use experience is affected.

The calculation speed represents the speed of the speech recognition model when speech recognition is performed, for example, the calculation speed can be represented by the number of floating point operations per second (floating point operations per second, flow). The calculated real-time rate represents the real-time rate of the speech recognition model when performing speech recognition, for example, the calculated real-time rate may be represented by real-time factor (RTF). The computing power deployment parameter represents the computing power deployment capability of the electronic device, for example, the computing power deployment parameter may be represented by a core index set by the electronic device.

TABLE 1

As can be seen from the above table 1, in the case of the electronic device with high computing power deployment capability of a78-2.4G, the speech recognition model employed in the related art is the same as the parameter amount of the speech recognition model employed in the embodiment of the present application, but the speech recognition model employed in the embodiment of the present application is superior to the speech recognition model employed in the related art in both the calculation speed and the calculation real-time rate. In the face of the low-power deployment capability of the A78-1.8G electronic equipment, the voice recognition model adopted in the embodiment of the application is superior to the voice recognition model adopted in the related technology in three aspects of parameter quantity, calculation speed and calculation real-time rate. Compared with the related art, the voice recognition method provided by the embodiment of the application not only can consider the electronic equipment with low computing power deployment capability and the electronic equipment with high computing power deployment capability, but also can effectively reduce computing power resources consumed by voice recognition and reduce the time delay of voice recognition, thereby improving the efficiency of voice recognition and achieving the effects of high precision, low power consumption and low time delay.

For the speech recognition model adopted in fig. 6, before implementing the scheme, the initial model needs to be trained iteratively based on audio training data to obtain the speech recognition model, where the audio training data includes an audio sample and a labeling text of the audio sample. Fig. 11 is a flowchart of a training method of a speech recognition model according to an embodiment of the present application. Referring to fig. 11, the method includes the following S1101-S1102:

s1101, performing iterative training on the initial model based on the audio training data to obtain the voice recognition model.

The speech recognition model comprises a streaming encoder and a non-streaming encoder. The streaming encoder is configured to provide speech recognition services for electronic devices of a first device type, the streaming encoder in combination with the non-streaming encoder being configured to provide speech recognition services for electronic devices of a second device type, the computational power deployment capability of the first device type being lower than the computational power deployment capability of the second device type.

S1102, inputting the audio training data into a model obtained after the previous iterative training in the process of any iterative training, acquiring a first identification result of the audio sample through a streaming encoder of the model, determining a first model loss value of the streaming encoder based on the first identification result and the labeling text, acquiring a second identification result of the audio sample through a streaming encoder and a non-streaming encoder of the model, and determining a second model loss value of the non-streaming encoder based on the second identification result and the labeling text; model parameters are adjusted based on the first model loss value and the second model loss value.

According to the technical scheme provided by the embodiment of the application, the voice recognition model which can be better in voice recognition is obtained through iterative training of the initial model. The first model loss value of the streaming encoder is determined by utilizing a first recognition result obtained by voice recognition of the streaming encoder of the model and the labeling text. And determining a second model loss value of the non-streaming encoder by using a second recognition result obtained by performing voice recognition on the streaming encoder and the non-streaming encoder of the model and the marked text. Because the first model loss value can represent the difference between the speech recognition result output by the streaming encoder and the labeling text, and the second model loss value can represent the difference between the speech recognition result output by the non-streaming encoder and the labeling text, the model parameters are adjusted according to the first model loss value and the second model loss value, the learning capacity of the model can be improved, and the speech recognition model with better learning capacity can be obtained through training. Furthermore, the trained voice recognition model not only can be suitable for the electronic equipment of the first equipment type, namely the electronic equipment with low computing power deployment capability, but also can be suitable for the electronic equipment of the second equipment type, namely the electronic equipment with high computing power deployment capability, so that the flexibility of voice recognition is improved.

Fig. 12 is a flowchart of a training method of a speech recognition model according to an embodiment of the present application. Referring to fig. 12, taking as an example the process of any one of the iterative training in the model training, the method includes the following S1201-S1207:

and S1201, the electronic equipment inputs the audio training data into a model obtained after the previous iterative training in the process of any iterative training.

The audio training data refers to training data of an initial model, and comprises an audio sample and a labeling text of the audio sample.

Fig. 13 is a schematic diagram of a model framework of a model training stage according to an embodiment of the present application. Referring to fig. 13, in the model shown in fig. 13, a concatenated encoder including a streaming encoder and a non-streaming encoder, a prediction network, a fusion network, and a linear layer are employed.

Wherein, in the model training phase, the stream encoder is used for processing the characteristic sequence of the audio frames in the audio samples to extract effective characteristics from the characteristic sequence of the audio frames. The output of the stream encoder may be used as an input to a non-stream encoder that is used to process the output of the stream encoder to further extract the valid features. The prediction network is used to predict future text based on the historical speech recognition results (i.e., recognized text) of the audio samples. The output of the stream encoder and the output of the non-stream encoder are required to be input into a fusion network, and the fusion network is provided with a feature fusion function and is used for carrying out fusion processing on the output result of the cascade encoder and the output result of the prediction network, so that a final identification result is obtained. And, in the model training phase, the fusion network is also used to obtain the cyclic neural network transform (RNNT) loss value of the streaming encoder and the RNNT loss value of the non-streaming encoder. The linear layer is used to obtain timing class (Connectionist Temporal Classification, CTC) loss values for the streaming encoder and CTC loss values for the non-streaming encoder.

In the model shown in fig. 13, the streaming encoder shares the prediction network, the fusion network, and the linear layer with the non-streaming encoder, so that the parameter amount of the model can be effectively reduced, and the model with a smaller model parameter amount can be obtained through training.

S1202, the electronic device obtains a first recognition result of the audio sample through a streaming encoder of the model.

The first recognition result represents a voice recognition result obtained by voice recognition based on the stream encoder.

In some embodiments, the electronic device encodes the audio samples with the streaming encoder to obtain streaming encoding characteristics of the audio samples. Further, the electronic device determines a first recognition result of the audio sample based on the streaming encoding feature of the audio sample and the text prediction result of the audio sample. Wherein the stream coding feature is used to characterize the audio characteristics of each audio frame in the audio sample.

The text prediction result is predicted based on the speech recognition result of the audio frame preceding each audio frame in the audio sample. In some embodiments, the electronic device may input the speech recognition result of the previous audio frame of each audio frame into the prediction network of the model, and predict the current audio frame based on the speech recognition result of the previous audio frame through the prediction network to obtain the text prediction result of each audio frame. In this way, the prediction network is arranged in the model, the text prediction result of the current audio frame is predicted by using the prediction network, and then the first recognition result of the audio sample is determined by combining the text prediction result of the current audio frame.

S1203, the electronic device determines a first model loss value of the streaming encoder based on the first recognition result and the labeling text.

The labeling text may be a manually labeled text label. The first model loss value represents a difference between the speech recognition result based on the output of the stream encoder and the annotated text.

In some embodiments, the electronic device determines an RNNT loss value of the streaming encoder as the first model loss value based on the first recognition result and the annotation text.

For example, the electronic device may determine the RNNT loss value of the stream encoder through a fusion network of the first recognition result and the annotated text input model.

In this way, by determining the RNNT loss value of the streaming encoder, the first model loss value of the streaming encoder can be quickly determined.

In some embodiments, the electronic device further determines a CTC loss value for the streaming encoder based on the streaming encoding characteristics of the audio samples. Further, the first model loss value is determined based on the CTC loss value of the stream encoder and the RNNT loss value of the stream encoder.

For example, the electronic device may input the stream-coding features of the audio samples into a linear layer of the model, through which CTC loss values of the stream encoder are determined. Further, after determining the CTC loss value of the streaming encoder, the CTC loss value of the streaming encoder and the RNNT loss value of the streaming encoder may be weighted and summed to obtain the first model loss value.

In the above embodiment, the first model loss value of the streaming encoder is comprehensively determined by combining the CTC loss value of the streaming encoder on the basis of the RNNT loss value of the streaming encoder, so that the information amount referred to in determining the model loss value is increased, and the accuracy of determining the model loss value is improved.

And S1204, the electronic equipment acquires a second identification result of the audio sample through the streaming coder and the non-streaming coder of the model.

The second recognition result represents a voice recognition result obtained by performing voice recognition based on the streaming encoder and the non-streaming encoder.

In some embodiments, the audio samples are encoded by the streaming encoder to obtain streaming encoding characteristics of the audio samples. And encoding the streaming coding characteristic of the audio sample by the non-streaming encoder to obtain the non-streaming coding characteristic of the audio sample. Further, a second recognition result of the audio sample is determined based on the non-streaming encoding characteristic of the audio sample and the text prediction result of the audio sample. Wherein the non-streaming coding feature is used to characterize the audio characteristics of the audio sample.

And S1205, the electronic device determines a second model loss value of the non-streaming encoder based on the second recognition result and the labeling text.

Wherein the second model loss value represents a difference between the speech recognition result based on the non-streaming encoder output and the annotated text.

In some embodiments, the electronic device determines an RNNT loss value for the non-streaming encoder as the second model loss value based on the second recognition result and the annotation text.

For example, the electronic device may determine the RNNT loss value of the non-streaming encoder via a fusion network of the second recognition result and the annotated text input model.

Thus, by determining the RNNT loss value of the non-streaming encoder, the second model loss value of the non-streaming encoder can be quickly determined.

In some embodiments, the electronic device further determines a CTC loss value for the non-streaming encoder based on the non-streaming encoding characteristics of the audio samples. Further, the second model loss value is determined based on the CTC loss value of the non-streaming encoder and the RNNT loss value of the non-streaming encoder.

For example, the electronic device may input the non-streaming encoding features of the audio samples into a linear layer of the model, through which CTC loss values for the non-streaming encoder are determined. Further, after determining the CTC loss value of the non-streaming encoder, a weighted summation process may be performed on the CTC loss value of the non-streaming encoder and the RNNT loss value of the non-streaming encoder to obtain the second model loss value.

In the above embodiment, the CTC loss value of the non-streaming encoder is combined on the basis of the RNNT loss value of the non-streaming encoder to comprehensively determine the second model loss value of the non-streaming encoder, so that the information amount referred to in determining the model loss value is increased, and the accuracy of determining the model loss value is improved.

It is noted that in another possible implementation, the electronic device can also obtain other types of model loss values to perform the above-mentioned procedure of adjusting the model parameters according to the model loss values. The embodiment of the present application is not limited thereto.

S1206, the electronic device adjusts the model parameters based on the first model loss value and the second model loss value.

In some embodiments, the electronic device performs a weighted summation process on the first model loss value and the second model loss value to obtain a model-trained loss value. Further, the model parameters are adjusted based on the loss values of the model training.

In one example, during any one iterative training, the audio samples Y may be processed _T =[y ₁ ,y ₂ ,…,y _T ]Input stream type encoder, obtaining stream type coding characteristic, which can be named as H ^s =[h ₁ ^s ,h ₂ ^s ,…,h _T ^s ]. Will H ^s Input to the linear layer, and the CTC loss value of the stream encoder obtained by calculation of the linear layer can be denoted as L _CTC ^s . And then H is added with ^s Inputting into a fusion network, fusing the output result of the prediction network through the fusion network, and calculating to obtain the RNNT loss value of the stream encoder, which can be recorded as L _RNNT ^s 。

Further, H is ^s Input to a non-stream encoder to obtain a non-stream coding characteristic as H ^ns =[h ₁ ^ns ,h ₂ ^ns ,…,h _T ^ns ]. Will H ^ns Input to the linear layer, and the CTC loss value of the non-stream encoder obtained by calculation of the linear layer can be denoted as L _CTC ^ns . And then H is added with ^ns Inputting into a fusion network, fusing the output result of the prediction network through the fusion network, and calculating to obtain the RNNT loss value of the non-streaming encoder, which can be recorded as L _RNNT ^ns 。

After obtaining the CTC loss value of the streaming encoder, the RNNT loss value of the streaming encoder, the CTC loss value of the non-streaming encoder, and the RNNT loss value of the non-streaming encoder, determining a model training loss value based on the weighted sum formula (1), and adjusting the model parameter based on the model training loss value, thereby completing the iterative training.

L=α ₁ *L _CTC ^s +α ₂ *L _RNNT ^s +α ₃ *L _CTC ^ns +α ₄ *L _RNNT ^ns （1）

Wherein L represents a loss value of model training; l (L) _CTC ^s A CTC loss value representing the stream encoder; alpha ₁ A weight coefficient corresponding to a CTC loss value representing a stream encoder, e.g., 0.2; l (L) _RNNT ^s An RNNT loss value representing the stream encoder; alpha ₂ A weight coefficient corresponding to the RNNT loss value representing the stream encoder, such as 1.0; l (L) _CTC ^ns CTC loss values representing non-streaming encoders; alpha ₃ A weight coefficient, such as 0.2, corresponding to a CTC loss value representing a non-streaming encoder; l (L) _RNNT ^ns An RNNT loss value representing a non-streaming encoder; alpha ₄ The RNNT loss value representing the non-streaming encoder corresponds to a weight coefficient, such as 1.0.

After adjusting the model parameters, the electronic device further determines whether the model training satisfies the target condition, and further executes S1207 if the model training does not satisfy the target condition. And under the condition that the model training meets the target condition, acquiring the model obtained by training in the iterative process as a voice recognition model.

S1207, under the condition that the model with the model parameters adjusted does not meet the target conditions, the electronic equipment performs next iteration training based on the model with the model parameters adjusted until the model meets the target conditions.

In some embodiments, the target condition satisfies at least one of the following conditions: the iteration number of model training reaches the target number; alternatively, the model loss value is less than or equal to the target threshold. The target number of times is a preset training iteration number, for example, the iteration number reaches 100. The setting of the target times is not limited in the embodiment of the application. The target threshold is a fixed threshold set in advance, for example, the model loss value is less than 0.0001. The setting of the target threshold is not limited in the embodiment of the application.

Note that, the electronic device for executing the model training process in fig. 12 may be the same as the electronic device for executing the model application process in fig. 6, or may be different from the electronic device for executing the model application process in fig. 6. For example, in some embodiments, the electronic device for performing the model application process in fig. 6 may be a terminal device, and the electronic device for performing the model training process in fig. 12 may be a server.

It will be appreciated that, in order to implement the above-described functions, the electronic device (such as a terminal) in the embodiment of the present application includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

Fig. 14 is a schematic diagram of a frame of a speech recognition device according to an embodiment of the present application. Referring to fig. 14, the voice recognition apparatus includes a receiving module 1401 and a voice recognition module 1402. Wherein,

the electronic device belongs to a first device type; a receiving module 1401, configured to receive an audio stream input by a user; a speech recognition module 1402, configured to perform speech recognition on the audio stream based on the stream encoder, to obtain a speech recognition result of the audio stream;

the electronic device is of a second device type; a receiving module 1401, configured to receive an audio stream input by a user; a speech recognition module 1402, configured to perform speech recognition on the audio stream based on the stream encoder and the non-stream encoder, to obtain a speech recognition result of the audio stream;

The technical scheme provided by the embodiment of the application provides a voice recognition model with a streaming encoder and a non-streaming encoder. For electronic devices of a first device type, voice recognition may be performed on an audio stream received by the electronic device based on a stream encoder. For electronic devices of the second device type, the audio streams received by the electronic devices may be speech-recognized in conjunction with a streaming encoder and a non-streaming encoder. Wherein the computing power deployment capability of the first device type is lower than the computing power deployment capability of the second device type. That is, for low-power deployment capable electronic devices, voice recognition may be performed on audio streams received by the electronic device based on the stream encoder. For high-power deployment capability electronic devices, the audio streams received by the electronic device may be speech-recognized in combination with the streaming and non-streaming encoders. Thus, considering the streaming nature of the streaming encoder for speech recognition based on audio frames, it is from the dimension of the audio frames for speech recognition, and thus the streaming encoder consumes less computational resources. Furthermore, when the method is applied to the electronic equipment with low computational power deployment capability, the voice recognition is performed based on the stream encoder, so that the computational power resources consumed by the voice recognition can be effectively reduced, the time delay of the voice recognition is reduced, and the efficiency of the voice recognition is improved. When the method is applied to the electronic equipment with high computing power deployment capability, the computing power deployment capability of the electronic equipment is high, so that the method can combine a streaming encoder with a non-streaming encoder to perform voice recognition, thereby not only ensuring the efficiency of voice recognition, but also improving the accuracy of voice recognition.

In some embodiments, the speech recognition module 1402 includes:

the input module is used for inputting the audio stream into the stream encoder according to the audio frame, and encoding the input audio frame by the stream encoder to obtain the stream coding characteristic of the audio frame, wherein the stream coding characteristic is used for representing the audio characteristic of the audio frame;

and the determining module is used for determining a first output probability distribution of the audio frame based on the stream coding characteristic of the audio frame and a text prediction result of the audio frame, wherein the text prediction result is predicted based on a voice recognition result of a previous audio frame of the audio frame, and the first output probability distribution represents probability distribution of the audio frame on a plurality of candidate texts corresponding to the text prediction result.

In some embodiments, the speech recognition model further comprises a fusion network;

the determining module is used for:

In some embodiments, the apparatus further comprises:

The decoding module is used for decoding the first output probability distribution of the audio frame to obtain a voice recognition result of the audio frame;

and the display module is used for displaying the voice recognition result of the audio frame on the electronic equipment.

In some embodiments, the speech recognition module 1402 includes:

the input module is further configured to input the streaming coding features of the plurality of audio frames in the audio stream to the non-streaming encoder, and encode the streaming coding features of the plurality of audio frames by the non-streaming encoder to obtain the non-streaming coding features of the audio stream, where the non-streaming coding features are used for characterizing the audio features of the audio stream;

and the determining module is used for determining a second output probability distribution of the audio stream based on the non-stream coding characteristic of the audio stream and text prediction results of a plurality of audio frames in the audio stream, wherein the text prediction results are predicted based on a voice recognition result of a last audio frame of the audio frame, and the second output probability distribution represents probability distribution of a plurality of audio frames in the audio stream on a plurality of candidate texts corresponding to the text prediction results respectively.

the determining module is used for:

In some embodiments, the apparatus further comprises:

the decoding module is used for decoding the second output probability distribution of the audio stream to obtain a voice recognition result of the audio stream;

and the display module is used for displaying the voice recognition result of the audio stream on the electronic equipment.

In some embodiments, the downsampling module includes a first convolution module and a second convolution module; the first convolution module is used for compressing the feature sequence length of the output feature; the second convolution module is used for fusing information among different channels.

Fig. 15 is a schematic frame diagram of a training device for a speech recognition model according to an embodiment of the present application. Referring to fig. 15, the training apparatus of the speech recognition model includes a training module 1501. Wherein,

a training module 1501 for performing iterative training on the initial model based on the audio training data to obtain the speech recognition model; the audio training data comprises an audio sample and a labeling text of the audio sample; the speech recognition model includes a streaming encoder and a non-streaming encoder; the streaming encoder is used for providing voice recognition services for the electronic equipment of the first equipment type, and the streaming encoder is combined with the non-streaming encoder and is used for providing voice recognition services for the electronic equipment of the second equipment type, wherein the computational power deployment capability of the first equipment type is lower than that of the second equipment type;

in any iterative training process, inputting the audio training data into a model obtained after the previous iterative training, acquiring a first identification result of the audio sample through a streaming encoder of the model, determining a first model loss value of the streaming encoder based on the first identification result and the labeling text, acquiring a second identification result of the audio sample through a streaming encoder and a non-streaming encoder of the model, and determining a second model loss value of the non-streaming encoder based on the second identification result and the labeling text; model parameters are adjusted based on the first model loss value and the second model loss value.

In some embodiments, the training module 1501 is configured to:

encoding the audio sample by the stream encoder to obtain stream encoding characteristics of the audio sample, wherein the stream encoding characteristics are used for representing audio characteristics of each audio frame in the audio sample;

and determining a first recognition result of the audio sample based on the stream coding feature of the audio sample and a text prediction result of the audio sample, wherein the text prediction result is predicted based on a voice recognition result of a last audio frame of each audio frame in the audio sample.

In some embodiments, the training module 1501 is configured to:

and determining an RNNT loss value of the stream encoder as the first model loss value based on the first recognition result and the labeling text.

In some embodiments, the training module 1501 is further configured to:

determining CTC loss values for the streaming encoder based on the streaming encoding characteristics of the audio samples;

the first model loss value is determined based on the CTC loss value of the streaming encoder and the RNNT loss value of the streaming encoder.

In some embodiments, the training module 1501 is configured to:

Encoding the streaming coding feature of the audio sample by the non-streaming encoder to obtain the non-streaming coding feature of the audio sample, wherein the non-streaming coding feature is used for representing the audio feature of the audio sample;

and determining a second recognition result of the audio sample based on the non-streaming coding feature of the audio sample and a text prediction result of the audio sample, wherein the text prediction result is predicted based on a voice recognition result of a last audio frame of each audio frame in the audio sample.

In some embodiments, the training module 1501 is configured to:

and determining an RNNT loss value of the non-streaming encoder as the second model loss value based on the second recognition result and the annotation text.

In some embodiments, the training module 1501 is further configured to:

determining CTC loss values for the non-streaming encoder based on non-streaming encoding characteristics of the audio samples;

the second model loss value is determined based on the CTC loss value of the non-streaming encoder and the RNNT loss value of the non-streaming encoder.

In some embodiments, the training module 1501 is configured to:

carrying out weighted summation processing on the first model loss value and the second model loss value to obtain a loss value of model training;

The model parameters are adjusted based on the loss values for the model training.

The embodiment of the application also provides electronic equipment, which comprises: a processor and a memory. A voice recognition model associated with the electronic device is stored in the memory, the voice recognition model comprising a streaming encoder and a non-streaming encoder; the electronic device belongs to a first device type or a second device type, and the computing power deployment capability of the first device type is lower than that of the second device type. The processor is connected with the memory, the memory is used for storing program codes, and the processor executes the program codes stored in the memory, so that the voice recognition method and the training method of the voice recognition model provided by the embodiment of the application are realized.

The embodiment of the application also provides a computer readable storage medium, and the computer readable storage medium stores program code thereon, when the program code runs on the electronic device, the electronic device is caused to execute the functions or steps executed by the electronic device in the above method embodiment.

The embodiment of the application also provides a computer program product, which comprises program code, when the program code runs on the electronic device, causes the electronic device to execute the functions or steps executed by the electronic device in the embodiment of the method.

The electronic device, the computer readable storage medium or the computer program product provided by the embodiments of the present application are configured to execute the corresponding method provided above, and therefore, the beneficial effects achieved by the electronic device, the computer readable storage medium or the computer program product can refer to the beneficial effects in the corresponding method provided above, and are not repeated herein.

It will be apparent to those skilled in the art from this disclosure that, for convenience and brevity, only the above-described division of functional modules is illustrated, and in practical applications, the above-described functional allocation may be performed by different functional modules, that is, the internal structure of the apparatus (e.g., electronic device) is divided into different functional modules, so as to perform all or part of the above-described functions. The specific working processes of the above-described system, apparatus (e.g., electronic device) and unit may refer to the corresponding processes in the foregoing method embodiments, which are not described herein again.

In the several embodiments provided herein, it should be understood that the disclosed systems, apparatuses (e.g., electronic devices) and methods may be implemented in other ways. For example, the above-described embodiments of an apparatus (e.g., an electronic device) are merely illustrative, and the division of the module or unit is merely a logical function division, and may be implemented in other ways, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned storage medium includes: flash memory, removable hard disk, read-only memory, random access memory, magnetic or optical disk, and the like.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A voice recognition method, characterized by being applied to an electronic device, wherein the electronic device is associated with a voice recognition model, and the voice recognition model comprises a streaming encoder and a non-streaming encoder;

the electronic equipment belongs to a first equipment type; the method comprises the following steps:

receiving an audio stream input by a user; performing voice recognition on the audio stream based on the stream encoder to obtain a voice recognition result of the audio stream;

or the electronic equipment belongs to a second equipment type; the method comprises the following steps:

receiving an audio stream input by a user; performing voice recognition on the audio stream based on the stream encoder and the non-stream encoder to obtain a voice recognition result of the audio stream;

2. The method of claim 1, wherein said speech recognition of the audio stream based on the stream encoder comprises:

inputting the audio stream into the stream encoder according to the audio frame, and encoding the input audio frame by the stream encoder to obtain stream coding characteristics of the audio frame, wherein the stream coding characteristics are used for representing audio characteristics of the audio frame;

and determining a first output probability distribution of the audio frame based on the stream coding characteristic of the audio frame and a text prediction result of the audio frame, wherein the text prediction result is predicted based on a voice recognition result of a previous audio frame of the audio frame, and the first output probability distribution represents probability distribution of the audio frame on a plurality of candidate texts corresponding to the text prediction result.

3. The method of claim 2, wherein the speech recognition model further comprises a fusion network;

the determining a first output probability distribution of the audio frame based on the stream coding feature of the audio frame and the text prediction result of the audio frame comprises:

inputting the stream coding characteristics of the audio frames and the text prediction results of the audio frames into the fusion network, and carrying out fusion processing on the stream coding characteristics of the audio frames and the text prediction results of the audio frames through the fusion network to obtain first output probability distribution of the audio frames.

4. A method according to claim 2 or 3, wherein after said determining a first output probability distribution for said audio frame based on said streaming coding characteristics of said audio frame and a text prediction result of said audio frame, said method further comprises:

decoding the first output probability distribution of the audio frame to obtain a voice recognition result of the audio frame;

and displaying the voice recognition result of the audio frame on the electronic equipment.

5. The method of claim 1, wherein the speech recognition of the audio stream based on the stream encoder and the non-stream encoder comprises:

inputting the streaming coding features of a plurality of audio frames in the audio stream into the non-streaming coder, and coding the streaming coding features of the plurality of audio frames through the non-streaming coder to obtain the non-streaming coding features of the audio stream, wherein the non-streaming coding features are used for representing the audio features of the audio stream;

And determining a second output probability distribution of the audio stream based on the non-stream coding characteristic of the audio stream and text prediction results of a plurality of audio frames in the audio stream, wherein the text prediction results are predicted based on a voice recognition result of a last audio frame of the audio frames, and the second output probability distribution represents probability distribution of a plurality of audio frames in the audio stream on a plurality of candidate texts corresponding to the text prediction results respectively.

6. The method of claim 5, wherein the speech recognition model further comprises a fusion network;

the determining a second output probability distribution of the audio stream based on the non-streaming coding feature of the audio stream and text predictions of a plurality of audio frames in the audio stream comprises:

inputting the non-stream coding characteristics of the audio stream and the text prediction results of a plurality of audio frames in the audio stream into the fusion network, and carrying out fusion processing on the non-stream coding characteristics of the audio stream and the text prediction results of a plurality of audio frames in the audio stream through the fusion network to obtain second output probability distribution of the audio stream.

7. The method of claim 5 or 6, wherein after determining the second output probability distribution for the audio stream based on the non-streaming coding characteristics of the audio stream and text predictions for a plurality of audio frames in the audio stream, the method further comprises:

Decoding the second output probability distribution of the audio stream to obtain a voice recognition result of the audio stream;

and displaying the voice recognition result of the audio stream on the electronic equipment.

8. The method of claim 1, wherein the speech recognition model further comprises a downsampling module for downsampling output characteristics of at least one network layer of the streaming encoder or the non-streaming encoder.

9. The method of claim 8, wherein the downsampling module comprises a first convolution module and a second convolution module; the first convolution module is used for compressing the feature sequence length of the output feature; the second convolution module is used for fusing information among different channels.

10. A method of training a speech recognition model, the method comprising:

iteratively training an initial model based on audio training data to obtain the speech recognition model; the audio training data comprises an audio sample and a labeling text of the audio sample; the speech recognition model comprises a streaming encoder and a non-streaming encoder; the streaming coder is used for providing voice recognition service for the electronic equipment of the first equipment type, and the streaming coder is combined with the non-streaming coder and used for providing voice recognition service for the electronic equipment of the second equipment type, wherein the computing power deployment capacity of the first equipment type is lower than that of the second equipment type;

In any iterative training process, inputting the audio training data into a model obtained after the previous iterative training, acquiring a first recognition result of the audio sample through a streaming encoder of the model, determining a first model loss value of the streaming encoder based on the first recognition result and the labeling text, acquiring a second recognition result of the audio sample through a streaming encoder and a non-streaming encoder of the model, and determining a second model loss value of the non-streaming encoder based on the second recognition result and the labeling text; model parameters are adjusted based on the first model loss value and the second model loss value.

11. The method of claim 10, wherein the obtaining, by the streaming encoder of the model, the first recognition result of the audio sample comprises:

and determining a first recognition result of the audio sample based on the stream coding characteristic of the audio sample and a text prediction result of the audio sample, wherein the text prediction result is predicted based on a voice recognition result of a last audio frame of each audio frame in the audio sample.

12. The method of claim 11, wherein the determining a first model loss value for the stream encoder based on the first recognition result and the annotation text comprises:

and determining a cyclic neural network transformation (RNNT) loss value of the stream encoder as the first model loss value based on the first recognition result and the labeling text.

13. The method according to claim 12, wherein the method further comprises:

determining a time sequence classification CTC loss value of the streaming encoder based on the streaming coding features of the audio samples;

14. The method of claim 10, wherein the obtaining, by the streaming encoder and the non-streaming encoder of the model, the second recognition result of the audio sample comprises:

and determining a second recognition result of the audio sample based on the non-stream coding characteristic of the audio sample and a text prediction result of the audio sample, wherein the text prediction result is predicted based on a voice recognition result of a last audio frame of each audio frame in the audio sample.

15. The method of claim 14, wherein the determining a second model loss value for the non-streaming encoder based on the second recognition result and the annotation text comprises:

and determining an RNNT loss value of the non-streaming coder as the second model loss value based on the second recognition result and the labeling text.

16. The method of claim 15, wherein the method further comprises:

17. The method of claim 10, wherein the adjusting model parameters based on the first model loss value and the second model loss value comprises:

carrying out weighted summation treatment on the first model loss value and the second model loss value to obtain a model training loss value;

and adjusting the model parameters based on the loss value of the model training.

18. An electronic device comprising a memory and a processor; the memory stores a voice recognition model associated with the electronic device, the voice recognition model comprising a streaming encoder and a non-streaming encoder; the electronic equipment belongs to a first equipment type or a second equipment type, and the computing power deployment capability of the first equipment type is lower than that of the second equipment type; the memory is used for storing program codes; the processor is configured to invoke the program code to perform the method of any of claims 1-9 or 10-17.

19. A computer readable storage medium comprising program code which, when run on an electronic device, causes the electronic device to perform the method of any of claims 1-9 or 10-17.