CN118173095A - Speech recognition method, apparatus, device, storage medium, and program product - Google Patents

Speech recognition method, apparatus, device, storage medium, and program product Download PDF

Info

Publication number
CN118173095A
CN118173095A CN202410165122.XA CN202410165122A CN118173095A CN 118173095 A CN118173095 A CN 118173095A CN 202410165122 A CN202410165122 A CN 202410165122A CN 118173095 A CN118173095 A CN 118173095A
Authority
CN
China
Prior art keywords
audio
frame
fusion
channel
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410165122.XA
Other languages
Chinese (zh)
Inventor
胡今朝
吴重亮
李永超
吴明辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202410165122.XA priority Critical patent/CN118173095A/en
Publication of CN118173095A publication Critical patent/CN118173095A/en
Pending legal-status Critical Current

Links

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The application provides a voice recognition method, a device, equipment, a storage medium and a program product, which concretely comprises the following implementation scheme: obtaining voice to be recognized and pseudo voice, wherein the voice to be recognized is multi-channel audio, and the pseudo voice is audio with the same length as the audio sequence of the multi-channel audio; performing feature fusion on each audio frame in each channel audio of the voice to be recognized and each audio frame of the pseudo voice to obtain a first fusion audio sequence corresponding to each channel audio and a second fusion audio sequence corresponding to the pseudo voice; and respectively decoding the first fusion audio sequence corresponding to each channel audio and the second fusion audio sequence corresponding to the pseudo speech, and determining the recognition text corresponding to the speech to be recognized. According to the technical scheme of the application, the accuracy of voice recognition can be effectively improved.

Description

Speech recognition method, apparatus, device, storage medium, and program product
Technical Field
The present application relates to the field of speech recognition technology, and in particular, to a speech recognition method, apparatus, device, storage medium, and program product.
Background
Multichannel audio refers to audio recorded by a plurality of sound receiving devices. For example, in a conference scenario, which includes multiple speakers and multiple microphones disposed at different locations, the multiple microphones are simultaneously receiving sound to obtain multi-channel audio.
In the related art, when multi-channel audio is identified, most of the multi-channel audio is identified in sequence aiming at single-channel audio, and the relation between the multi-channels cannot be effectively captured, so that voice identification is inaccurate.
Disclosure of Invention
In order to solve the above problems, the present application provides a method, apparatus, device, storage medium and program product for speech recognition, which can improve the accuracy of speech recognition.
According to a first aspect of an embodiment of the present application, there is provided a speech recognition method, including:
Obtaining voice to be recognized and pseudo voice, wherein the voice to be recognized is multi-channel audio, and the pseudo voice is audio with the same length as the audio sequence of the multi-channel audio;
Performing feature fusion on each audio frame in each channel audio of the voice to be recognized and each audio frame of the pseudo voice to obtain a first fusion audio sequence corresponding to each channel audio and a second fusion audio sequence corresponding to the pseudo voice;
And respectively decoding the first fusion audio sequence corresponding to each channel audio and the second fusion audio sequence corresponding to the pseudo speech, and determining the recognition text corresponding to the speech to be recognized.
According to a second aspect of an embodiment of the present application, there is provided a voice recognition apparatus including:
The device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring voice to be recognized and pseudo voice, wherein the voice to be recognized is multi-channel audio, and the pseudo voice is audio with the same length as an audio sequence of the multi-channel audio;
The processing module is used for carrying out feature fusion on each audio frame in each channel audio of the voice to be recognized and each audio frame of the pseudo voice to obtain a first fusion audio sequence corresponding to each channel audio and a second fusion audio sequence corresponding to the pseudo voice;
And the recognition module is used for respectively decoding the first fusion audio sequence corresponding to each channel audio and the second fusion audio sequence corresponding to the pseudo voice and determining a recognition text corresponding to the voice to be recognized.
A third aspect of the present application provides an electronic device, comprising:
A memory and a processor;
the memory is connected with the processor and used for storing programs;
the processor implements the above-mentioned voice recognition method by running the program in the memory.
A fourth aspect of the present application provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described speech recognition method.
A fifth aspect of the application provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described speech recognition method.
One embodiment of the above application has the following advantages or benefits:
The method comprises the steps of obtaining voice to be recognized and pseudo voice, wherein the voice to be recognized is multi-channel audio, and the pseudo voice is audio which is pre-constructed and has the same length as an audio sequence of the multi-channel audio; feature fusion is carried out on the basis of each audio frame of each channel audio and each audio frame of pseudo voice in the multi-channel audio, so that a first fusion audio sequence corresponding to each channel audio and a second fusion audio sequence corresponding to the pseudo voice are obtained; and respectively decoding the first fusion audio sequence corresponding to each channel audio and the second fusion audio sequence corresponding to the pseudo speech, and determining the recognition text corresponding to the speech to be recognized. In this way, the characteristic interaction is carried out on the audio frame in the pseudo speech and the audio frame among all channels, so that the characteristic fusion among the multiple channels at the frame level is realized, meanwhile, the independence of different channels is reserved, the problem of characteristic fusion distortion caused by poor signal of a certain channel is avoided, and the accuracy of a speech recognition result is ensured.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a voice recognition method according to an embodiment of the present application;
Fig. 2 is a schematic diagram of audio inter-frame feature fusion according to an embodiment of the present application;
FIG. 3 is a schematic diagram of providing channels and speech recognition for speech according to an embodiment of the present application;
Fig. 4 is a schematic diagram of step S120 in a voice recognition method according to an embodiment of the present application;
Fig. 5 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical scheme of the embodiment of the application is suitable for being applied to various speech recognition scenes, such as conference scenes, online education scenes and the like. By adopting the technical scheme of the embodiment of the application, the accuracy of the voice recognition result can be improved.
The technical scheme of the embodiment of the application can be exemplarily applied to hardware equipment such as a processor, electronic equipment, a server (comprising a cloud server) and the like, or packaged into a software program to be operated, and when the hardware equipment executes the processing process of the technical scheme of the embodiment of the application, or the software program is operated, the characteristic fusion among multiple channels at a frame level can be realized, and meanwhile, the purpose of keeping the independence of different channels is achieved. The embodiment of the application only exemplary introduces the specific processing procedure of the technical scheme of the application, but does not limit the specific implementation form of the technical scheme of the application, and any technical implementation form capable of executing the processing procedure of the technical scheme of the application can be adopted by the embodiment of the application.
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Exemplary method
Fig. 1 is a flowchart of a voice recognition method according to an embodiment of the present application. In an exemplary embodiment, a method for speech recognition is provided, comprising:
S110, acquiring voice to be recognized and pseudo voice, wherein the voice to be recognized is multi-channel audio, and the pseudo voice is audio with the same length as an audio sequence of the multi-channel audio;
S120, carrying out feature fusion on each audio frame in each channel audio of the voice to be recognized and each audio frame of the pseudo voice to obtain a first fusion audio sequence corresponding to each channel audio and a second fusion audio sequence corresponding to the pseudo voice;
S130, respectively decoding the first fusion audio sequence corresponding to each channel audio and the second fusion audio sequence corresponding to the pseudo speech, and determining the recognition text corresponding to the speech to be recognized.
In step S110, speech to be recognized is illustratively emitted for a plurality of sound sources. Wherein the sound sources include speakers, e.g., speaker a, speaker B, speaker C, and speaker D are sound sources, respectively. In addition, the voice to be recognized is multi-channel audio, and if the voice to be recognized is obtained by receiving through 8 microphones, the voice to be recognized is 8-channel audio. The pseudo speech is an audio with the same length as the audio sequence of the multi-channel audio, which is randomly acquired in a preset audio library, wherein the preset audio library can be an audio library with any open source, and can also be a pre-constructed audio library, which is not limited herein.
Specifically, multichannel audio (i.e., speech to be recognized) is acquired from a plurality of sound receiving apparatuses such as microphones. And determining the audio sequence length of the multi-channel audio, and randomly acquiring the audio with the audio sequence length from a preset audio library as pseudo voice.
In step S120, the first fused audio sequence is illustratively generated by respectively performing feature fusion on each audio frame of each channel to obtain each fused audio frame, and arranging each audio frame. The second fusion audio sequence is generated by respectively carrying out feature fusion on each audio frame of the pseudo voice to obtain each fused audio frame and further arranging each audio frame.
Specifically, feature fusion may be performed on each audio frame in each channel audio of the speech to be recognized and each audio frame of the pseudo speech according to the order of the audio sequences, for example, feature fusion is performed on the first frame audio frame of each channel and the first frame audio frame of the pseudo speech, and so on, the second frame audio frame and the third frame audio frame are fused in sequence, so as to generate a first fused audio sequence corresponding to each channel audio and a second fused audio sequence corresponding to the pseudo speech.
Further, the feature fusion from frame to frame may be calculated from the correlation between audio frames. In the present embodiment, a Self-Attention mechanism (Self-Attention) is employed to calculate each audio frame in the respective channel audio of the speech to be recognized and each audio frame of the pseudo speech, respectively. Self-Attention (Self-Attention) is a neural network-based machine learning technique, and is mainly used for processing sequence data, such as machine translation, part-of-speech tagging, semantic analysis and other problems. The core principle is that the model is allowed to capture the relation between each element in the input sequence, so that the generalization capability of the model and the capability of processing long sequences are improved. The implementation of the self-attention mechanism is as follows: for each input vector a, in this case each word vector, a vector b is output after Self-Attention, which is obtained taking all input vectors into account. If there are four word vectors a corresponding to each other, four vectors b are output. As shown in fig. 2, the audio sequences of the channels are input to the corresponding encoders, self-Attention operation is performed on all frames in the channel dimension, namely weighted interaction fusion based on Attention coefficients is performed, and the fusion result is added to the corresponding frames of each channel, so that complementary information of other channels is used when each channel is independently modeled, and layered interaction operation is completed.
Step S130 includes: decoding the kth frame audio frame in each first fusion audio sequence and the kth frame audio frame in the second fusion audio sequence respectively to obtain decoding posterior probability corresponding to the kth frame audio frame of each first fusion audio sequence and decoding posterior probability corresponding to the kth frame audio frame of the second fusion audio sequence;
determining target recognition text of the kth frame of audio frame according to the comparison result of the decoding posterior probability corresponding to the kth frame of audio frame of each first fusion audio sequence and the decoding posterior probability corresponding to the kth frame of audio frame of the second fusion audio sequence; wherein k is a positive integer.
For example, the same decoder may be employed to decode the kth frame of audio in each of the first fused audio sequences and the kth frame of audio in the second fused audio sequences. It is also possible to decode the k-th frame of audio frames in the respective first fused audio sequences and the k-th frame of audio frames in the second fused audio sequences simultaneously with different decoders. The decoding posterior probability represents the probability of an audio frame on each word of a preset dictionary, wherein the preset dictionary represents a database of word compositions.
Specifically, the 1 st frame audio frame in each first fusion audio sequence and the 1 st frame audio frame in the second fusion audio sequence are respectively input into decoders corresponding to each channel, each decoder decodes the 1 st frame audio frame, the probability of each word of the 1 st frame audio frame in a preset dictionary is determined, and the decoding posterior probability with the maximum probability is output as the decoding posterior probability of each decoder. In this way, decoding posterior probabilities of the 1 st frame audio frames of the first fusion audio sequence and the second fusion audio sequence are obtained, then the decoding posterior probabilities of the 1 st frame audio frames of the first fusion audio sequence and the second fusion audio sequence are compared, and a text corresponding to the maximum decoding posterior probability is used as a text corresponding to the 1 st frame audio frame. And repeating the steps, sequentially decoding the audio frames in the first fusion audio sequence and the second fusion audio sequence until the decoding of the audio frame of the last frame is finished, and sequencing the texts corresponding to each frame to generate the recognition texts corresponding to the voice to be recognized. It should be noted that if the decoding posterior probability of the audio frame in the second fused audio sequence is greater, it is indicated that the pseudo speech obtains more complementary information. If the decoding posterior probability of the audio frames in the first fusion audio sequence is larger, the signal quality of a certain channel is poorer, and the noise introduced by the pseudo speech is larger than the complementary information when the feature fusion is performed. Therefore, the enhancement of the posterior level can solve the problem that fusion information is damaged when a channel fails, and the complementary advantages of the channel where the pseudo voice is located can be highlighted.
In the technical scheme of the application, the voice to be recognized and the pseudo voice are obtained, wherein the voice to be recognized is multi-channel audio, and the pseudo voice is audio which is pre-constructed and has the same length as the audio sequence of the multi-channel audio; feature fusion is carried out on the basis of each audio frame of each channel audio and each audio frame of pseudo voice in the multi-channel audio, so that a first fusion audio sequence corresponding to each channel audio and a second fusion audio sequence corresponding to the pseudo voice are obtained; and respectively decoding the first fusion audio sequence corresponding to each channel audio and the second fusion audio sequence corresponding to the pseudo speech, and determining the recognition text corresponding to the speech to be recognized. In this way, the characteristic interaction is carried out on the audio frame in the pseudo speech and the audio frame among all channels, so that the characteristic fusion among the multiple channels at the frame level is realized, meanwhile, the independence of different channels is reserved, the problem of characteristic fusion distortion caused by poor signal of a certain channel is avoided, and the accuracy of a speech recognition result is ensured.
In one embodiment, after the obtaining the speech to be recognized and the pseudo speech, the method further comprises:
Exchanging the j-th frame audio frames of each channel audio in the multi-channel audio to obtain updated multi-channel audio; wherein j is a positive integer.
Illustratively, one frame of audio frame of each channel audio in the multi-channel audio may be interchanged, and multiple frames of audio frame of each channel audio in the multi-channel audio may be interchanged. Thus, the j-th frame audio frames of each channel audio are interchanged, which is helpful for fusion between frames of each channel, so as to enhance the robustness of single channel modeling.
Optionally, acquiring the j-th frame audio frames of each channel audio in the multi-channel audio for scrambling operation, and sequentially returning the N sequenced j-th frame audio frames to each channel audio to obtain updated multi-channel audio; wherein N is a positive integer, and N is not less than j. In this embodiment, the scrambling operation may be a shuffle operation. For example, the 1 st frame of audio frame of channel 1 is a, the 1 st frame of audio frame of channel 2 is B, the 1 st frame of audio frame of channel 3 is C, and the 1 st frame of audio frame of channel 4 is D. And carrying out random scrambling operation on the audio frames to obtain the 1 st frame of the audio frames of the channel 1 as B, the 1 st frame of the audio frames of the channel 2 as D, the 1 st frame of the audio frames of the channel 3 as A and the 1 st frame of the audio frames of the channel 4 as C. And then the 2 nd frame audio frame and the 3 rd frame audio frame of the channel 1-4 are respectively subjected to scrambling operation, so that the 1 st frame, the 2 nd frame audio frame and the 3 rd frame audio frame of the channel 1-4 after the scrambling operation are obtained, and the updated audio sequence of the channel 1-4 is obtained.
Optionally, an interchange rule is preset, that is, each channel is selected to interchange audio frames, and then interchange processing is performed on the selected audio frames according to the interchange rule, so that updated multi-channel audio is obtained. The rule of interchange may be adjusted according to actual needs, and is not limited herein. For example, an interchange rule, that is, a1 st frame audio frame of channel 1 and a1 st frame audio frame of channel 3 are exchanged, and a1 st frame audio frame of channel 2 and a1 st frame audio frame of channel 4 are exchanged, is set in advance. The 2 nd frame audio frame of channel 1 and the 2 nd frame audio frame of channel 2 are exchanged, the 2 nd frame audio frame of channel 3 and the 2 nd frame audio frame of channel 4 are exchanged, etc. According to the above interchange rule, directly interchange the 1 st frame in the audio sequence of channel 1 with the 1 st frame in channel 3, interchange the 2 nd frame in the audio sequence of channel 1 with the 2 nd frame in channel 2, interchange the 1 st frame in the audio sequence of channel 2 with the 1 st frame in channel 4, and interchange the 2 nd frame in the audio sequence of channel 3 with the 2 nd frame in channel 4, thus obtaining the updated audio sequence of channels 1-4.
In one embodiment, feature fusion is performed on each audio frame in each channel audio of the speech to be recognized and each audio frame of the pseudo speech to obtain a first fused audio sequence corresponding to each channel audio and a second fused audio sequence corresponding to the pseudo speech, where the feature fusion includes:
In each stage of audio coding process of multi-stage audio coding, respectively carrying out feature fusion on each audio frame in each channel audio of the voice to be recognized and each audio frame of the pseudo voice to obtain a first fusion audio sequence corresponding to each channel audio and a second fusion audio sequence corresponding to the pseudo voice;
Wherein, the processing result of the first-stage audio coding is used as the processing object of the second-stage audio coding; the first-stage audio encoding and the second-stage audio encoding are adjacent two-time audio encoding.
Illustratively, multi-level audio encoding means that audio is encoded multiple times. Each channel may be provided with a corresponding encoder, or multiple channels may share the same encoder.
In this embodiment, as shown in fig. 3, the number of channels of the multi-channel audio is 4, and these four channels are respectively denoted as channel 1, channel 2, channel 3 and channel 4, and the input of each channel is respectively input into the corresponding encoder (encoder) of the respective channel, assuming that there are M layers in the encoder, and the structure of each layer of the encoder adopts Conformer Block. Each layer of the encoder takes part in the forward computation of this layer first after receiving the input. That is, each audio frame in the respective channel audio is performed at the first layer of the encoder, and each audio frame of the pseudo speech is subjected to feature fusion. The first layer of the encoder outputs a first fused audio sequence corresponding to each channel audio and a second fused audio sequence corresponding to the pseudo speech. And carrying out feature fusion on each audio frame in each fusion first audio sequence output by the first layer of the encoder and each audio frame in the second fusion audio sequence corresponding to the pseudo speech at the second layer of the encoder, inputting the output result of the second layer of the encoder to the third layer of the encoder until the feature fusion of the last layer of the encoder is finished, and taking the output result of the last layer of the encoder as the output result of the whole encoder. And respectively inputting the output results of the encoders to corresponding decoders for decoding, and sequentially decoding each audio frame in the output results of the encoders by the decoders until the decoding of the audio frame of the last frame is finished, so as to obtain the decoding posterior probability corresponding to each audio frame output by each decoder. And taking the text corresponding to the maximum decoding posterior probability as the text corresponding to each frame of audio frame. And sequencing the texts corresponding to each frame to generate a recognition text corresponding to the voice to be recognized. Therefore, the voice to be recognized is subjected to multi-level audio coding, so that the pseudo voice subjected to multiple feature interactions can more comprehensively know the complementary information of each channel.
In one embodiment, as shown in fig. 4, the feature fusion is performed on each audio frame in each channel audio of the speech to be recognized and each audio frame of the pseudo speech to obtain a first fused audio sequence corresponding to each channel audio and a second fused audio sequence corresponding to the pseudo speech, where the feature fusion includes:
S1210, performing feature fusion on an ith frame of audio frequency of N channels and an ith frame of audio frequency of the pseudo speech to obtain an (n+1) th fused ith frame of audio frequency;
S1210, sequencing the (n+1) th frame of audio frames after fusion by utilizing the sequence of the audio sequences of the N channel audios to respectively obtain a first fusion audio sequence corresponding to each channel audio and a second fusion audio sequence corresponding to the pseudo speech; wherein N and i are positive integers.
Optionally, step S1210 includes: respectively carrying out feature fusion on audio frames in the audio of each channel to obtain a third fusion audio sequence of each channel; and carrying out feature fusion on the ith frame of the third fusion audio sequence of each channel and the ith frame of the pseudo speech to obtain n+1 th fused ith frame of audio frames.
Specifically, feature fusion between audio frames is calculated by a self-attention mechanism. The N channel audios are calculated by the attention mechanism, and the number of frames of the audio sequences is not changed by the self-attention mechanism, so that third fusion audio sequences with the same number of frames as the original audio sequences are output, and N third fusion audio sequences are obtained. And then calculating the 1 st frame and the 1 st frame of the pseudo voice in the extracted N third fusion audio sequences by using a self-attention mechanism to obtain N+1 th fused i-th frame audio frames.
Furthermore, the attention mechanism can be utilized to calculate the audio sequence of the pseudo voice to obtain a third fusion audio sequence of the pseudo voice. And then calculating the 1 st frame in the extracted third fusion audio sequence of the N channels and the 1 st frame of the third fusion audio sequence of the pseudo voice by using a self-attention mechanism to obtain the n+1 th fused i-th frame audio frame. Therefore, the voice to be recognized and the pseudo voice are subjected to self audio frame information interaction, and meanwhile, information interaction among different channels is performed, and high-level semantics and bottom-level semantics are fully fused in a characteristic layer.
Further, each frame in the audio sequence of each channel is processed in the above manner, and then n+1 fused results are output for each audio frame. According to the sequence of the audio sequences of the N channels of audio, a first frame and a second frame of the channel 1, a last frame of the channel 2, and a last frame of the channel N are obtained in the fused result. The audio sequence is updated for each channel as a fused audio sequence. In this way, the sorting of the n+1 fused ith frame of audio frames is completed, and a first fused audio sequence corresponding to each channel of audio and a second fused audio sequence corresponding to the pseudo speech are obtained.
For example, if the audio sequence of each channel is 3 frames. The audio sequence of channel 1, the audio sequence of channel 2, the audio sequence of channel 3, the audio sequence of channel 4 and the audio sequence corresponding to the pseudo speech are respectively input into the first layer of the encoder corresponding to the respective channel.
The first layer of the encoder performs self attention computations on the audio sequence of channel 1, the audio sequence of channel 2, the audio sequence of channel 3, the audio sequence of channel 4, and the audio sequence corresponding to the pseudo speech, respectively. And outputting a third fusion audio sequence of each channel and a third fusion audio sequence of the pseudo speech, wherein the third fusion audio sequence is 3 frames.
And carrying out self attention calculation on the 1 st frame of the third fusion audio sequence of the channel 1, the 1 st frame of the third fusion audio sequence of the channel 2, the 1 st frame of the third fusion audio sequence of the channel 3, the 1 st frame of the third fusion audio sequence of the channel 4 and the 1 st frame of the third fusion audio sequence of the pseudo speech to obtain the 1 st frame after the first fusion of the channels 1-4 and the 1 st frame after the first fusion of the pseudo speech. Carrying out self attention calculation on the 2 nd frame of the third fusion audio sequence of the channel 1-4 and the 2 nd frame of the third fusion audio sequence of the pseudo speech to obtain the 2 nd frame after the first fusion of the channel 1-4 and the 2 nd frame after the first fusion of the pseudo speech. And carrying out self attention calculation on the 3 rd frame of the third fusion audio sequence of the channels 1-4 and the 3 rd frame of the third fusion audio sequence of the pseudo speech to obtain the 3 rd frame after the first fusion of the channels 1-4 and the 3 rd frame after the first fusion of the pseudo speech. It will be appreciated that the output fused audio frames are arranged according to channels 1,2, 3,4 and the channels corresponding to the pseudo speech. Therefore, the 1 st frame after the 5 first fusion, the 2 nd frame after the 5 first fusion and the 3 rd frame after the 5 first fusion are returned to the channels 1-4 and the channels of the pseudo speech according to the sequence of the audio sequences of the channels, and the first layer of the encoder outputs the first fused audio sequence of the channels and the second fused audio sequence of the pseudo speech.
And under the condition that the encoder has multiple layers, inputting the fusion audio sequence output by the first layer of the encoder to the second layer of the encoder, and carrying out inter-frame characteristic fusion in the mode until the last layer of the encoder outputs. And inputting the last layer of the encoder to a decoder for decoding, and sequentially decoding each audio frame in the output result of the encoder by the decoder until the decoding of the audio frame of the last frame is finished, so as to obtain the decoding posterior probability corresponding to each audio frame output by each decoder. And taking the text corresponding to the maximum decoding posterior probability as the text corresponding to each frame of audio frame.
Exemplary apparatus
Accordingly, fig. 5 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application. In an exemplary embodiment, there is provided a voice recognition apparatus including:
An obtaining module 510, configured to obtain a voice to be recognized and a pseudo voice, where the voice to be recognized is a multi-channel audio, and the pseudo voice is an audio with the same length as an audio sequence of the multi-channel audio;
the processing module 520 is configured to perform feature fusion on each audio frame in each channel audio of the speech to be recognized and each audio frame of the pseudo speech to obtain a first fused audio sequence corresponding to each channel audio and a second fused audio sequence corresponding to the pseudo speech;
and the recognition module 530 is configured to decode the first fused audio sequence corresponding to the audio of each channel and the second fused audio sequence corresponding to the pseudo speech, and determine a recognition text corresponding to the speech to be recognized.
In one embodiment, the apparatus further comprises:
The audio frame sequence interchange module is used for interchange of the j-th frame audio frames of each channel audio in the multi-channel audio to obtain updated multi-channel audio; wherein j is a positive integer.
In one embodiment, the processing module comprises:
the multi-stage audio coding module is used for respectively carrying out feature fusion on each audio frame in each channel audio of the voice to be recognized and each audio frame of the pseudo voice in each stage audio coding process of multi-stage audio coding to obtain a first fusion audio sequence corresponding to each channel audio and a second fusion audio sequence corresponding to the pseudo voice;
Wherein, the processing result of the first-stage audio coding is used as the processing object of the second-stage audio coding; the first-stage audio encoding and the second-stage audio encoding are adjacent two-time audio encoding.
In one embodiment, the processing module includes:
The first fusion module is used for carrying out feature fusion on the ith frame of audio frame of the N channel audio and the ith frame of audio frame of the pseudo speech to obtain an (N+1) th fused ith frame of audio frame;
The second fusion module is used for sequencing the (n+1) th frame of audio frames after fusion by utilizing the sequence of the audio sequences of the N channel audios to respectively obtain a first fusion audio sequence corresponding to each channel audio and a second fusion audio sequence corresponding to the pseudo speech; wherein N and i are positive integers.
In one embodiment, the first fusion module is further configured to:
respectively carrying out feature fusion on audio frames in the audio of each channel to obtain a third fusion audio sequence of each channel;
And carrying out feature fusion on the ith frame of the third fusion audio sequence of each channel and the ith frame of the pseudo speech to obtain n+1 th fused ith frame of audio frames.
In one embodiment, the identification module is further configured to:
Decoding the kth frame audio frame in each first fusion audio sequence and the kth frame audio frame in the second fusion audio sequence respectively to obtain decoding posterior probability corresponding to the kth frame audio frame of each first fusion audio sequence and decoding posterior probability corresponding to the kth frame audio frame of the second fusion audio sequence;
determining target recognition text of the kth frame of audio frame according to the comparison result of the decoding posterior probability corresponding to the kth frame of audio frame of each first fusion audio sequence and the decoding posterior probability corresponding to the kth frame of audio frame of the second fusion audio sequence; wherein k is a positive integer.
The voice recognition device provided in this embodiment belongs to the same application conception as the voice recognition method provided in the foregoing embodiment of the present application, and may execute the voice recognition method provided in any of the foregoing embodiments of the present application, and has a functional module and beneficial effects corresponding to executing the voice recognition method. Technical details not described in detail in this embodiment may be referred to the specific processing content of the voice recognition method provided in the foregoing embodiment of the present application, and will not be described herein.
The functions performed by the above acquisition module 510, the processing module 520, and the identification module 530 may be implemented by the same or different processors, respectively, and embodiments of the present application are not limited.
It should be appreciated that the modules in the above apparatus may be implemented in the form of processor-invoked software. For example, the device includes a processor, where the processor is connected to a memory, and the memory stores instructions, and the processor invokes the instructions stored in the memory to implement any of the methods above or to implement functions of each unit of the device, where the processor may be a general-purpose processor, such as a CPU or a microprocessor, and the memory may be a memory within the device or a memory outside the device. Or the units in the device may be implemented in the form of hardware circuits, and the functions of some or all of the units may be implemented by designing a hardware circuit, where the hardware circuit may be understood as one or more processors; for example, in one implementation, the hardware circuit is an ASIC, and the functions of some or all of the above units are implemented by designing the logic relationships of the elements in the circuit; for another example, in another implementation, the hardware circuit may be implemented by a PLD, for example, an FPGA may include a large number of logic gates, and the connection relationship between the logic gates is configured by a configuration file, so as to implement the functions of some or all of the above units. All units of the above device may be realized in the form of processor calling software, or in the form of hardware circuits, or in part in the form of processor calling software, and in the rest in the form of hardware circuits.
In an embodiment of the present application, the processor is a circuit with signal processing capability, and in an implementation, the processor may be a circuit with instruction reading and running capability, such as a CPU, a microprocessor, a GPU, or a DSP, etc.; in another implementation, the processor may implement a function through a logical relationship of hardware circuitry that is fixed or reconfigurable, e.g., a hardware circuit implemented by the processor as an ASIC or PLD, such as an FPGA, or the like. In the reconfigurable hardware circuit, the processor loads the configuration document, and the process of implementing the configuration of the hardware circuit may be understood as a process of loading instructions by the processor to implement the functions of some or all of the above units. Furthermore, a hardware circuit designed for artificial intelligence may be provided, which may be understood as an ASIC, such as NPU, TPU, DPU, etc.
It will be seen that each of the units in the above apparatus may be one or more processors (or processing circuits) configured to implement the above method, for example: CPU, GPU, NPU, TPU, DPU, microprocessors, DSP, ASIC, FPGA, or a combination of at least two of these processor forms.
Furthermore, the units in the above apparatus may be integrated together in whole or in part, or may be implemented independently. In one implementation, these units are integrated together and implemented in the form of an SOC. The SOC may include at least one processor for implementing any of the methods above or for implementing the functions of the units of the apparatus, where the at least one processor may be of different types, including, for example, a CPU and an FPGA, a CPU and an artificial intelligence processor, a CPU and a GPU, and the like.
Exemplary electronic device
Another embodiment of the present application also proposes an electronic device, as shown in fig. 6, including:
a memory 600 and a processor 610;
wherein the memory 600 is connected to the processor 610, and is used for storing a program;
the processor 610 is configured to implement the speech recognition method disclosed in any of the above embodiments by executing the program stored in the memory 600.
Specifically, the electronic device may further include: a bus, a communication interface 620, an input device 630, and an output device 640.
The processor 610, the memory 600, the communication interface 620, the input device 630, and the output device 640 are connected to each other by a bus. Wherein:
A bus may comprise a path that communicates information between components of a computer system.
The processor 610 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., or may be an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with aspects of the present invention. But may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
The processor 610 may include a main processor, and may also include a baseband chip, a modem, and the like.
The memory 600 stores programs for implementing the technical scheme of the present invention, and may also store an operating system and other critical services. In particular, the program may include program code including computer-operating instructions. More specifically, memory 600 may include read-only memory (ROM), other types of static storage devices that may store static information and instructions, random access memory (random access memory, RAM), other types of dynamic storage devices that may store information and instructions, disk storage, flash, and the like.
The input device 630 may include means for receiving data and information entered by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input means, touch screen, pedometer, or gravity sensor, among others.
Output device 640 may include means such as a display screen, printer, speakers, etc. that allow information to be output to a user.
The communication interface 620 may include devices using any transceiver or the like to communicate with other devices or communication networks, such as ethernet, radio Access Network (RAN), wireless Local Area Network (WLAN), etc.
The processor 610 executes programs stored in the memory 600 and invokes other devices that may be used to implement the various steps of the speech recognition method provided by any of the above-described embodiments of the present application.
Exemplary computer program product and storage Medium
In addition to the methods and apparatus described above, embodiments of the application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in a speech recognition method according to various embodiments of the application described in the "exemplary methods" section of this specification.
The computer program product may write program code for performing operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps in the speech recognition method according to the various embodiments of the present application described in the "exemplary method" section of the present specification, and details of the electronic device described above, and details of the computer program product described above and details of the computer program on the storage medium when the computer program is executed by the processor, may be referred to the details of the method embodiments described above, which are not repeated herein.
For the foregoing method embodiments, for simplicity of explanation, the methodologies are shown as a series of acts, but one of ordinary skill in the art will appreciate that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders or concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.
The steps in the method of each embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs, and the technical features described in each embodiment can be replaced or combined.
The modules and the submodules in the device and the terminal of the embodiments of the application can be combined, divided and deleted according to actual needs.
In the embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of modules or sub-modules is merely a logical function division, and there may be other manners of division in actual implementation, for example, multiple sub-modules or modules may be combined or integrated into another module, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.
The modules or sub-modules illustrated as separate components may or may not be physically separate, and components that are modules or sub-modules may or may not be physical modules or sub-modules, i.e., may be located in one place, or may be distributed over multiple network modules or sub-modules. Some or all of the modules or sub-modules may be selected according to actual needs to achieve the purpose of the embodiment.
In addition, each functional module or sub-module in the embodiments of the present application may be integrated in one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated in one module. The integrated modules or sub-modules may be implemented in hardware or in software functional modules or sub-modules.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software elements may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of speech recognition, comprising:
Obtaining voice to be recognized and pseudo voice, wherein the voice to be recognized is multi-channel audio, and the pseudo voice is audio with the same length as the audio sequence of the multi-channel audio;
Performing feature fusion on each audio frame in each channel audio of the voice to be recognized and each audio frame of the pseudo voice to obtain a first fusion audio sequence corresponding to each channel audio and a second fusion audio sequence corresponding to the pseudo voice;
And respectively decoding the first fusion audio sequence corresponding to each channel audio and the second fusion audio sequence corresponding to the pseudo speech, and determining the recognition text corresponding to the speech to be recognized.
2. The method of claim 1, wherein after the obtaining the speech to be recognized and the pseudo speech, the method further comprises:
Exchanging the j-th frame audio frames of each channel audio in the multi-channel audio to obtain updated multi-channel audio; wherein j is a positive integer.
3. The method according to claim 1 or 2, wherein feature fusion is performed on each audio frame in each channel audio of the speech to be recognized and each audio frame of the pseudo speech to obtain a first fused audio sequence corresponding to each channel audio and a second fused audio sequence corresponding to the pseudo speech, including:
In each stage of audio coding process of multi-stage audio coding, respectively carrying out feature fusion on each audio frame in each channel audio of the voice to be recognized and each audio frame of the pseudo voice to obtain a first fusion audio sequence corresponding to each channel audio and a second fusion audio sequence corresponding to the pseudo voice;
Wherein, the processing result of the first-stage audio coding is used as the processing object of the second-stage audio coding; the first-stage audio encoding and the second-stage audio encoding are adjacent two-time audio encoding.
4. The method according to claim 3, wherein the feature fusing each audio frame in the channel audio of the speech to be recognized and each audio frame of the pseudo speech to obtain a first fused audio sequence corresponding to the channel audio and a second fused audio sequence corresponding to the pseudo speech includes:
performing feature fusion on an ith frame of audio frequency frame of N channel audio frequencies and an ith frame of audio frequency frame of the pseudo speech to obtain an (n+1) th fused ith frame of audio frequency frame;
Sequencing the n+1 fused ith frame of audio frames by utilizing the sequence of the audio sequences of the N channel audios to respectively obtain a first fused audio sequence corresponding to each channel audio and a second fused audio sequence corresponding to the pseudo speech; wherein N and i are positive integers.
5. The method of claim 4, wherein feature fusing the N-channel audio i-th frame audio frame and the pseudo speech i-th frame audio frame to obtain n+1 fused i-th frame audio frames, comprising:
respectively carrying out feature fusion on audio frames in the audio of each channel to obtain a third fusion audio sequence of each channel;
And carrying out feature fusion on the ith frame of the third fusion audio sequence of each channel and the ith frame of the pseudo speech to obtain n+1 th fused ith frame of audio frames.
6. The method of claim 1, wherein decoding the first fused audio sequence corresponding to each channel audio and the second fused audio sequence corresponding to the pseudo speech, respectively, determines the recognition text corresponding to the speech to be recognized, comprising:
Decoding the kth frame audio frame in each first fusion audio sequence and the kth frame audio frame in the second fusion audio sequence respectively to obtain decoding posterior probability corresponding to the kth frame audio frame of each first fusion audio sequence and decoding posterior probability corresponding to the kth frame audio frame of the second fusion audio sequence;
determining target recognition text of the kth frame of audio frame according to the comparison result of the decoding posterior probability corresponding to the kth frame of audio frame of each first fusion audio sequence and the decoding posterior probability corresponding to the kth frame of audio frame of the second fusion audio sequence; wherein k is a positive integer.
7. A speech recognition apparatus, comprising:
The device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring voice to be recognized and pseudo voice, wherein the voice to be recognized is multi-channel audio, and the pseudo voice is audio with the same length as an audio sequence of the multi-channel audio;
The processing module is used for carrying out feature fusion on each audio frame in each channel audio of the voice to be recognized and each audio frame of the pseudo voice to obtain a first fusion audio sequence corresponding to each channel audio and a second fusion audio sequence corresponding to the pseudo voice;
And the recognition module is used for respectively decoding the first fusion audio sequence corresponding to each channel audio and the second fusion audio sequence corresponding to the pseudo voice and determining a recognition text corresponding to the voice to be recognized.
8. An electronic device, comprising:
A memory and a processor;
the memory is connected with the processor and used for storing programs;
the processor implements the speech recognition method according to any one of claims 1 to 6 by running a program in the memory.
9. A storage medium having stored thereon a computer program which, when executed by a processor, implements the speech recognition method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the speech recognition method according to any one of claims 1 to 6.
CN202410165122.XA 2024-02-05 2024-02-05 Speech recognition method, apparatus, device, storage medium, and program product Pending CN118173095A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410165122.XA CN118173095A (en) 2024-02-05 2024-02-05 Speech recognition method, apparatus, device, storage medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410165122.XA CN118173095A (en) 2024-02-05 2024-02-05 Speech recognition method, apparatus, device, storage medium, and program product

Publications (1)

Publication Number Publication Date
CN118173095A true CN118173095A (en) 2024-06-11

Family

ID=91349588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410165122.XA Pending CN118173095A (en) 2024-02-05 2024-02-05 Speech recognition method, apparatus, device, storage medium, and program product

Country Status (1)

Country Link
CN (1) CN118173095A (en)

Similar Documents

Publication Publication Date Title
Peng et al. Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding
US20220115005A1 (en) Multi-task training architecture and strategy for attention-based speech recognition system
CN110287461B (en) Text conversion method, device and storage medium
WO2020024646A1 (en) Monaural multi-talker speech recognition with attention mechanism and gated convolutional networks
CN110795552A (en) Training sample generation method and device, electronic equipment and storage medium
CN112528637A (en) Text processing model training method and device, computer equipment and storage medium
Parekh et al. Weakly supervised representation learning for audio-visual scene analysis
CN111241232A (en) Business service processing method and device, service platform and storage medium
Teng et al. Two local models for neural constituent parsing
KR20230159371A (en) Voice recognition method and device, and storage medium
CN111507726B (en) Message generation method, device and equipment
CN112837669A (en) Voice synthesis method and device and server
CN111241820A (en) Bad phrase recognition method, device, electronic device, and storage medium
CN114757210A (en) Translation model training method, sentence translation method, device, equipment and program
Lee et al. Many-to-many unsupervised speech conversion from nonparallel corpora
KR102418260B1 (en) Method for analyzing customer consultation record
Broughton et al. Improving end-to-end neural diarization using conversational summary representations
CN117197891A (en) Multi-mode bone action recognition method and device
CN118173095A (en) Speech recognition method, apparatus, device, storage medium, and program product
CN116306672A (en) Data processing method and device
CN111048065B (en) Text error correction data generation method and related device
CN116306612A (en) Word and sentence generation method and related equipment
Ali et al. Spatio-temporal features representation using recurrent capsules for monaural speech enhancement
CN114896973A (en) Text processing method and device and electronic equipment
CN112687262A (en) Voice conversion method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination