CN116844534A

CN116844534A - Voice recognition method and device

Info

Publication number: CN116844534A
Application number: CN202310300312.3A
Authority: CN
Inventors: 李思琪; 付立
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-10-03

Abstract

The invention discloses a method and a device for voice recognition, and relates to the technical field of artificial intelligence. One embodiment of the method comprises the following steps: dividing voice data received by a voice recognition scene into a plurality of voice data blocks, adding a plurality of filling voice frames for each voice data block, extracting a voice frame group containing a set number of voice frames based on the filled voice data blocks, and processing voice frames in the input voice frame group by utilizing a preset voice recognition model to obtain a recognition result; wherein the number of convolution kernels contained in the preset voice recognition model is consistent with the set number; the embodiment of the invention solves the problem of low voice recognition accuracy caused by the lack of consideration of the relevance of voice information of the current frame and the voice frame after the current frame in one voice data block in the prior art, and improves the voice recognition effect.

Description

Voice recognition method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for voice recognition.

Background

With the technical development of artificial intelligence deep learning in recent years, a voice recognition system based on a neural network is greatly improved. Speech recognition is a technique that converts a speech sequence into a corresponding text sequence; wherein, the streaming voice recognition can be better used for scenes (such as live caption, meeting real-time record, voice input, voice wakeup, etc.) needing to acquire the recognition result in real time.

The current stream type voice recognition model uses a causal convolution mode, when calculating a voice frame contained in a voice recognition data block, only the current voice frame and the voice frame before the time sequence are utilized, but the voice frame after the current voice frame in the voice data block cannot be utilized; therefore, the existing causal convolution mode-based voice recognition is poor in recognition effect and low in recognition accuracy because the relevance of voice information of the current frame and the voice frames behind the current frame in one voice data block is not considered.

Disclosure of Invention

In view of this, an embodiment of the present invention provides a method and apparatus for speech recognition, which can divide speech data received in a speech recognition scene into a plurality of speech data blocks, add a plurality of filling speech frames to each speech data block, perform extraction of a speech frame group including a set number of speech frames based on the filled speech data blocks, and process speech frames in the input speech frame group by using a preset speech recognition model to obtain a recognition result; wherein the number of convolution kernels contained in the preset voice recognition model is consistent with the set number; the embodiment of the invention solves the problem of low voice recognition accuracy caused by the lack of relevance of voice information of the current frame and the voice frame after the current frame in one voice data block in the prior art, and improves the voice recognition effect.

To achieve the above object, according to one aspect of the embodiments of the present invention, there is provided a method for speech recognition, including: receiving voice data to be recognized in response to triggering voice recognition, and dividing the voice data into a plurality of voice data blocks, wherein the voice data blocks comprise a plurality of initial voice frames with continuous time sequences; for each of the blocks of speech data, sequentially performing: adding a plurality of padding speech frames to the speech data block; extracting a plurality of groups of voice frame groups from the voice data block according to a set step length, wherein the total number of initial voice frames and filling voice frames contained in each group of voice frame groups is a set number, and the set step length indicates the position difference between extraction positions of every two adjacent groups of voice frame groups; according to a time sequence, inputting an initial voice frame and a filling voice frame contained in each voice frame group into a preset voice recognition model for each voice frame group to obtain a voice recognition result of the voice frame group; the number of convolution kernels contained in the preset voice recognition model is the set number; obtaining a recognition result aiming at the voice data block according to the voice recognition result corresponding to each voice frame group; and determining a target recognition result of the voice data according to the recognition result of each recognized voice data block and sending the target recognition result.

Optionally, adding a plurality of padding voice frames to the voice data block includes: acquiring a time sequence of a plurality of initial voice frames contained in the voice data block; taking the first voice frame as the first time sequence in a plurality of initial voice frames; a second speech frame as the last of the time series of the plurality of initial speech frames; a plurality of padded speech frames are added before and after a plurality of time-series continuous initial speech frames contained in the speech data block, respectively.

Optionally, the method for voice recognition further comprises: and determining the set step length according to the number of the initial voice frames and the number of the filling voice frames contained in the voice data block so as to enable the number of the initial voice frames contained in the voice data block to be consistent with the group number of the voice frame groups.

Optionally, the method for voice recognition includes: each of the groups of speech frames includes all of the initial speech frames; and the target voice frames at the set positions in each voice frame group are the main influencing factors of the preset voice recognition model, and the other voice frames outside the set positions are the associated influencing factors of the preset voice recognition model.

Optionally, the method for voice recognition further comprises: and recognizing the target voice frame at the set position in the voice frame group through the voice recognition model, and taking the voice recognition result of the voice frame group as the voice recognition result of the target voice frame.

Optionally, the preset speech recognition model includes: presetting an encoder; wherein the preset encoder comprises the set number of convolution kernels; the inputting the initial voice frame and the filling voice frame contained in the voice frame group into a preset voice recognition model comprises the following steps: inputting the initial voice frames and the filling voice frames contained in the voice frame group into the preset encoder, executing convolution operation on the initial voice frames and the filling voice frames contained in the voice frame group by utilizing the convolution kernels of the set quantity contained in the preset encoder, and outputting the characteristics of the voice frame group according to the convolution operation result.

Optionally, the preset speech recognition model further includes: a time sequence classification model, an attention decoder; inputting the initial voice frame and the filling voice frame contained in the voice frame group into a preset voice recognition model to obtain a voice recognition result of the voice frame group, wherein the voice recognition result comprises the following steps: inputting the characteristics of the voice frame group into the time sequence classification model, and acquiring a first text characteristic corresponding to the voice frame group output by the time sequence classification model; and inputting the characteristics of the voice frame group and the first text characteristics of the voice frame group into the attention decoder to obtain a voice recognition result of the voice frame group.

Optionally, the objective function of the time sequence classification model and the objective function of the attention decoder are overlapped to obtain a model objective function of the preset voice recognition model; training the preset voice recognition model; and evaluating a training result of a preset voice recognition model by using the model objective function, adjusting the time sequence classification model and/or the attention decoder according to the training result, and adjusting the weight of the objective function of the time sequence classification model or the objective function of the attention decoder in the model objective function.

To achieve the above object, according to a second aspect of an embodiment of the present invention, there is provided a device for speech recognition, including: processing a voice data module, identifying the voice data module and sending an identification result module; wherein,

the voice data processing module is used for responding to triggering voice recognition, receiving voice data to be recognized and dividing the voice data into a plurality of voice data blocks, wherein the voice data blocks comprise a plurality of initial voice frames with continuous time sequences;

the voice data recognition module is configured to, for each of the voice data blocks, sequentially perform: adding a plurality of padding speech frames to the speech data block; extracting a plurality of groups of voice frame groups from the voice data block according to a set step length, wherein the total number of initial voice frames and filling voice frames contained in each group of voice frame groups is a set number, and the set step length indicates the position difference between extraction positions of every two adjacent groups of voice frame groups; according to a time sequence, inputting an initial voice frame and a filling voice frame contained in each voice frame group into a preset voice recognition model for each voice frame group to obtain a voice recognition result of the voice frame group; the number of convolution kernels contained in the preset voice recognition model is the set number;

The sending recognition result module is used for obtaining a recognition result aiming at the voice data block according to the voice recognition result corresponding to each voice frame group; and determining a target recognition result of the voice data according to the recognition result of each recognized voice data block and sending the target recognition result.

Optionally, the voice recognition device is configured to add a plurality of padding voice frames to the voice data block, including: acquiring a time sequence of a plurality of initial voice frames contained in the voice data block; taking the first voice frame as the first time sequence in a plurality of initial voice frames; a second speech frame as the last of the time series of the plurality of initial speech frames; a plurality of padded speech frames are added before and after a plurality of time-series continuous initial speech frames contained in the speech data block, respectively.

Optionally, the voice recognition device is further configured to determine the set step length according to the number of initial voice frames contained in the voice data block and the number of padding voice frames, so that the number of initial voice frames contained in the voice data block is consistent with the group number of the voice frame groups.

Optionally, the voice recognition device includes each of the voice frame groups including all the initial voice frames; and the target voice frames at the set positions in each voice frame group are the main influencing factors of the preset voice recognition model, and the other voice frames outside the set positions are the associated influencing factors of the preset voice recognition model.

Optionally, the voice recognition device is further configured to recognize a target voice frame at a set position in the voice frame group through the voice recognition model, and use a voice recognition result of the voice frame group as a voice recognition result of the target voice frame.

Optionally, the voice recognition device includes the preset voice recognition model including: presetting an encoder; wherein the preset encoder comprises the set number of convolution kernels; the inputting the initial voice frame and the filling voice frame contained in the voice frame group into a preset voice recognition model comprises the following steps: inputting the initial voice frames and the filling voice frames contained in the voice frame group into the preset encoder, executing convolution operation on the initial voice frames and the filling voice frames contained in the voice frame group by utilizing the convolution kernels of the set quantity contained in the preset encoder, and outputting the characteristics of the voice frame group according to the convolution operation result.

Optionally, the voice recognition device includes the preset voice recognition model further includes: a time sequence classification model, an attention decoder; inputting the initial voice frame and the filling voice frame contained in the voice frame group into a preset voice recognition model to obtain a voice recognition result of the voice frame group, wherein the voice recognition result comprises the following steps: inputting the characteristics of the voice frame group into the time sequence classification model, and acquiring a first text characteristic corresponding to the voice frame group output by the time sequence classification model; and inputting the characteristics of the voice frame group and the first text characteristics of the voice frame group into the attention decoder to obtain a voice recognition result of the voice frame group.

Optionally, the voice recognition device is configured to superimpose the objective function of the time sequence classification model and the objective function of the attention decoder to obtain a model objective function of the preset voice recognition model; training the preset voice recognition model; and evaluating a training result of a preset voice recognition model by using the model objective function, adjusting the time sequence classification model and/or the attention decoder according to the training result, and adjusting the weight of the objective function of the time sequence classification model or the objective function of the attention decoder in the model objective function.

To achieve the above object, according to a third aspect of an embodiment of the present invention, there is provided an electronic device for voice recognition, including: one or more processors; and a storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the methods of speech recognition above.

To achieve the above object, according to a fourth aspect of embodiments of the present invention, there is provided a computer-readable medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements a method as described in any one of the above-described methods of speech recognition.

One embodiment of the above invention has the following advantages or benefits: the method comprises the steps that voice data received in a voice recognition scene can be divided into a plurality of voice data blocks, a plurality of filling voice frames are added for each voice data block, voice frame groups containing a set number of voice frames are extracted based on the filled voice data blocks, and voice frames in the input voice frame groups are processed by utilizing a preset voice recognition model to obtain a recognition result; wherein the number of convolution kernels contained in the preset voice recognition model is consistent with the set number; the embodiment of the invention solves the problem of low voice recognition accuracy caused by the lack of relevance of voice information of the current frame and the voice frame after the current frame in one voice data block in the prior art, and improves the voice recognition effect.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a flow chart of a method for speech recognition according to an embodiment of the present invention;

FIG. 2A is a schematic diagram of a flow of processing a block of speech data in a causal convolution of the prior art;

FIG. 2B is a schematic diagram of a flow for processing a block of speech data in a non-causal convolution provided by one embodiment of the present invention;

FIG. 3 is a schematic diagram of a preset speech recognition model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a voice recognition apparatus according to an embodiment of the present invention;

FIG. 5 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;

fig. 6 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the technical scheme of the invention, the aspects of the related personal information of the user, such as acquisition, collection, updating, analysis, processing, use, transmission, storage and the like, all conform to the rules of related laws and regulations, are used for legal purposes, and do not violate the popular public order. Necessary measures are taken for the personal information of the user, illegal access to the personal information data of the user is prevented, and the personal information security, network security and national security of the user are maintained.

After receiving the voice data, de-identification processing is carried out on the data by technical means, and when the voice recognition result information is displayed, the information is desensitized by adopting a de-identification or anonymization processing mode so as to protect the information security.

As shown in fig. 1, an embodiment of the present invention provides a method for speech recognition, which may include the following steps:

step S101: in response to triggering voice recognition, receiving voice data to be recognized, and dividing the voice data into a plurality of voice data blocks, wherein the voice data blocks comprise a plurality of initial voice frames which are continuous in time sequence.

Specifically, speech recognition is a technique of converting a speech sequence into a corresponding text sequence, and is classified into non-streaming speech recognition and streaming speech recognition in a manner of returning a result. The streaming voice recognition is to perform the language recognition in a manner of returning a recognition result in real time in the process of processing the audio stream; the streaming language recognition can be better applied to scenes needing to acquire recognition results in real time, such as live broadcast real-time captions, conference real-time records, voice input, voice awakening and the like.

The embodiment of the invention is applied to streaming voice recognition, and can be applied to voice recognition triggered by different voice recognition application scenes contained in different application clients (or service ends); for example, aiming at a client terminal containing live broadcast real-time caption scene, when new live broadcast voice data is generated, voice recognition can be triggered; for example, for a client including a voice input scene, voice recognition may be triggered when a start voice is monitored and voice data is generated.

Further, in the process of performing voice recognition, the received voice data is first divided into a plurality of voice data blocks, that is, divided into a plurality of chunks (i.e., voice data blocks), and then recognized for each chunk. In the description of the present invention, the chunk and the voice data block refer to voice data blocks divided by voice data to be recognized.

Since the voice data has a time series characteristic, the voice frames in the voice data block also have a time series characteristic, namely, the voice data block comprises a plurality of initial voice frames which are continuous in time series; for example, a block of speech data (i.e., a chunk) contains 4 initial speech frames denoted as [1,2,3,4], where the order of the 4 initial speech frames is sequential in time series.

Step S102: step S103-step S105 are sequentially performed for each of the voice data blocks.

Specifically, after a plurality of voice data blocks are divided, the operations of steps S103 to S105 are performed for each voice data block in time order (i.e., sequentially).

Step S103: a plurality of padded speech frames are added to the speech data block.

Specifically, the convolution operation in the preset speech recognition model used in the embodiment of the present invention is a non-causal convolution operation; the differences between the non-causal convolution orthoacids and causal convolution operations, and the differences in adding a plurality of padded speech frames to the speech data block, are described below in connection with fig. 2A and 2B.

As shown in fig. 2A, in the causal convolution method, convolution calculation uses only the current speech frame and the preceding speech information. The lower rectangular boxes in fig. 2A represent input speech frames to be convolutionally calculated, for example, 9 speech frames, and speech data blocks associated with the 9 speech frames are, for example, 3 chunk (chunk 1, chunk2, chunk 3); the upper rectangular boxes in fig. 2A represent the convolution results of the convolution calculation; the causal convolution output of each voice frame is obtained by carrying out convolution calculation on the input of the current voice frame and the input of the previous 3 voice frames; for the first 3 speech frames, a padding operation (e.g., supplementing 3 0-valued speech frames) is required before the input sequence to perform causal convolution calculations. Therefore, the causal convolution operation method can only see the 1 st frame but not the 2,3 and 4 frames when calculating the 1 st frame, can only see the 1 st and 2 nd frames but not the 3 and 4 frames when calculating the 2 nd frame, and the like. That is, the causal convolution method cannot utilize future information of the current voice frame in the same chunk, so that when the streaming voice recognition is executed, particularly for sentence ends and the like corresponding to voice data, the problem of low voice recognition accuracy exists.

In view of this, the embodiment of the present invention uses a non-causal convolution method, and can use future voice information in the chunk to improve the effect of voice recognition, thereby improving the accuracy of voice recognition. Specifically, as shown in fig. 2B, a plurality of rectangular boxes at the bottom in fig. 2B represent input speech frames to be convolved, and a plurality of rectangular boxes at the top represent convolution results of the convolution calculation; for example, the chunk1 contains 4 speech frames, three speech frames may be padded (padded) before and after chunk1, and then convolved. I.e. adding a number of padded speech frames to the speech data block. Wherein the padded speech frames may be 0-valued speech frames. Further, adding a plurality of padded speech frames to the speech data block, comprising: acquiring a time sequence of a plurality of initial voice frames contained in the voice data block; taking the first voice frame as the first time sequence in a plurality of initial voice frames; a second speech frame as the last of the time series of the plurality of initial speech frames; a plurality of padded speech frames are added before and after a plurality of time-series continuous initial speech frames contained in the speech data block, respectively. As shown in fig. 2B, the chunk1 includes 4 initial speech frames [1,2,3,4], "1" being a first speech frame in time series, and a time series of the initial speech frames being the first in time series; "4" is a second speech frame as the last of the time series in the plurality of initial speech frames; respectively adding 3 filling voice frames before 1 and after 4 to obtain 9 voice frames, namely adding a plurality of filling voice frames for the voice data block to obtain a voice data block to be processed, for example: each speech frame contained in the speech data block after the padding speech frame is denoted as [0,0,0,1,2,3,4,0,0,0], which speech data block is subsequently further convolved. The embodiment of the invention adopts a non-causal convolution method, and a padding 0 mode is executed for each chunk, wherein the padding 0 is the problem that a plurality of filling voice frames are added, so that the convolution calculation cannot destroy stream identification feeling because of the layer number stacking of a plurality of voice data blocks; and the voice information of the voice frame corresponding to the future time in the same chunk can be utilized to improve the effect of stream voice recognition.

Step S104: and extracting a plurality of groups of voice frame groups from the voice data block according to a set step length, wherein the total number of initial voice frames and filling voice frames contained in each group of voice frame groups is a set number, and the set step length indicates the position difference between the extraction positions of every two adjacent groups of voice frame groups.

Specifically, each of the speech frames included in the speech data block obtained according to step S102 is represented as [0,0,0,1,2,3,4,0,0,0].

Further, extracting a plurality of groups of voice frame groups according to a set step length; for example, if the set step size is 1, extracting multiple groups of speech frame groups according to the set step size may be: [0,0,0,1,2,3,4], [0,0,1,2,3,4,0], [0,1,2,3,4,0,0], [1,2,3,4,0,0,0]; each voice frame group comprises 7 voice frames (7 is a set number), wherein the 7 voice frames comprise 4 initial voice frames (1, 2,3 and 4) and 3 filling voice frames (a plurality of 0), namely, the total number of the initial voice frames and the filling voice frames contained in each voice frame group is the set number; for example, 7 speech frames including initial speech frames 1,2,3,4 and 3 padding speech frames are in the speech frame group [0,0,0,1,2,3,4], and when extracting the speech frame group, different speech frame groups are obtained by shifting a set step length according to time sequence, i.e. the set step length indicates a position difference between extraction positions of every two adjacent speech frame groups. For example: the two adjacent groups of extracted voice frames are [0,0,0,1,2,3,4], [0,0,1,2,3,4,0], and for [0,0,0,1,2,3,4,0,0,0], the position difference of the first voice frame in the two groups is 1; and 1 is the set step length.

Further, the set step length is determined according to the number of initial voice frames contained in the voice data block and the number of filling voice frames, so that the number of initial voice frames contained in the voice data block is consistent with the group number of the voice frame groups. For example, the initial speech frames 4 included in the speech data block are denoted as [1,2,3,4], and the 4 initial speech frames are each speech frames that need to be processed in the speech recognition process. Extracting corresponding 4 voice frame groups for each initial voice frame; that is, it is necessary to extract 4 groups of voice frames (each group contains 7 voice frames and each group needs to contain 4 initial voice frames) from [0,0,0,1,2,3,4,0,0,0], and the operation of inputting a preset voice recognition model is further performed by determining a set step length to be 1 so that the number of initial voice frames (number of 4) contained in the voice data block coincides with the number of groups of the voice frame groups (number of 4), that is, each group of the voice frame groups contains all the initial voice frames.

In the embodiment of the present invention, by extracting a plurality of voice frame groups, when calculating a certain current voice frame, information of voice frames after the current voice frame can be associated, for example: calculating a first voice frame '1', wherein the input participating in calculation is a voice frame group [0,0,0,1,2,3,4], and taking a voice recognition result of the voice frame group as a voice recognition result of the target voice frame '1', wherein '1' is a target voice frame, namely a main influence factor input into a preset voice recognition model; the position of the 1 in the voice frame group is 4 (namely the set position), and other voice frames except the position 4 are associated influence factors of a preset voice recognition model for calculating the 1; similarly, the second speech frame is calculated as a speech frame group [0,0,1,2,3,4,0]; "2" is the target speech frame, namely the main influencing factor of the preset speech recognition model; wherein, the position of 2 in the voice frame group is 4 (namely the set position); and so on, so as to achieve the effect of calculating the information which can be related to 2,3 and 4 frames when the first voice frame is calculated, and the information which can be related to 3 and 4 frames when the second voice frame is calculated, and so on; thereby improving the effect and accuracy of voice recognition. That is, each of the groups of speech frames contains all of the initial speech frames; and the target voice frames at the set positions in each voice frame group are the main influencing factors of the voice recognition model, and the other voice frames outside the set positions are the associated influencing factors of the voice recognition model. And recognizing the target voice frame at the set position in the voice frame group through the voice recognition model, and taking the voice recognition result of the voice frame group as the voice recognition result of the target voice frame.

Step S105: according to a time sequence, inputting an initial voice frame and a filling voice frame contained in each voice frame group into a preset voice recognition model for each voice frame group to obtain a voice recognition result of the voice frame group; the number of convolution kernels contained in the preset voice recognition model is the set number; and obtaining a recognition result aiming at the voice data block according to the voice recognition result corresponding to each voice frame group.

In the embodiment of the present invention, the number of voice frames included in each voice frame group is 7, and the number of convolution kernels in the preset voice recognition model is 7, that is, the preset voice recognition model includes: presetting an encoder; wherein the preset encoder comprises the set number of convolution kernels; the beneficial effects of carrying out non-causal convolution operation on the voice frames contained in each voice frame group through a preset voice recognition model are achieved by processing the same number of voice frames through the same number of convolution kernels. The number of convolution kernels adopted in the causal convolution method adopted in the prior art is 4; it can be seen that the embodiments of the present invention implement non-causal convolution operations by adding a number of padded speech frames to a block of speech data and correspondingly expanding the number of convolution kernels. Thereby improving the effect and accuracy of voice recognition.

That is, the preset speech recognition model includes: presetting an encoder; wherein the preset encoder comprises the set number of convolution kernels; the inputting the initial voice frame and the filling voice frame contained in the voice frame group into a preset voice recognition model comprises the following steps: inputting the initial voice frames and the filling voice frames contained in the voice frame group into the preset encoder, executing convolution operation on the initial voice frames and the filling voice frames contained in the voice frame group by utilizing the convolution kernels of the set quantity contained in the preset encoder, and outputting the characteristics of the voice frame group according to the convolution operation result.

The following describes a preset speech recognition model according to an embodiment of the present invention with reference to fig. 3; fig. 3 is a schematic structural diagram of a preset speech recognition model according to an embodiment of the present invention, where, as shown in fig. 3, the preset speech recognition model according to an embodiment of the present invention is used as a multitasking model framework, and includes a preset encoder, and further includes: a time sequence classification model, an attention decoder; wherein the preset encoder comprises 7 convolution kernels, i.e. the preset encoder comprises the set number of convolution kernels.

Further, as shown in fig. 3, inputting the initial speech frame and the filling speech frame contained in the speech frame group into a preset speech recognition model to obtain a speech recognition result of the speech frame group, including: inputting the characteristics of the voice frame group into the time sequence classification model, and acquiring a first text characteristic corresponding to the voice frame group output by the time sequence classification model; and inputting the characteristics of the voice frame group and the first text characteristics of the voice frame group into the attention decoder to obtain a voice recognition result of the voice frame group. Specifically, in the process of processing the feature data of the voice frame, firstly, a voice frame group to be recognized (for example, represented by [0,0,0,1,2,3,4 ]) is input into a preset encoder, so that deep feature information of the voice frame to be recognized is extracted through various operation steps of convolution operation of the preset encoder, further, the feature of the voice frame group output by the preset encoder is input into a time sequence classification model, and a first text feature corresponding to the voice frame group and output by the time sequence classification model is acquired, wherein the first text feature can be a text label recognized based on the voice frame group by using the time sequence classification model, a probability determined as the text label and the like; further, the features of the voice frame group and the first text feature of the voice frame group are combined and input into an attention decoder, and the voice recognition result of the voice frame group is obtained through further processing.

In one embodiment of the present invention, the structure of the preset speech recognition model may be a CTC/Attention network structure based on a con encoder; wherein the time sequence classification model is, for example, CTC (Connectionist Temporal Classification, connection time sequence classification model, CTC); attention decoders are for example Attention; the preset encoder may be a encoder, i.e. a speech recognition model encoder based on a conformation (conformation-augmented Transformer for Speech Recognition) structure; wherein the encoder may comprise a neural network convolutional layer (comprising a set number of convolutional kernels); the order in which the voice features are processed by the encoder is, for example: firstly, carrying out frequency spectrum enhancement on input voice data, then carrying out neural network convolution, and entering N stacked configurator modules after passing through a linear layer (linear) and a random inactivation layer (dropout); further, the former module may further include a forward network 1, a self-attention module (self-attention), a convolutional neural network module, a forward network 2, and the like; it can be seen that the modeler obtains global modeling capability and local modeling capability for the speech features through a self-attention mechanism and a convolutional neural network.

Preferably, the iterative training is performed based on the speech recognition model shown in fig. 3 provided by the embodiment of the present invention, and the trained model is used as a preset speech recognition model, so as to perform the speech recognition operation through the preset speech recognition model. Specifically, for example, the time sequence classification model has an objective function 1, the attention decoder has an objective function 2, and the speech recognition model in the embodiment of the present invention superimposes the objective function 1 of the time sequence classification model and the objective function 2 of the attention decoder to be the objective function 3 (i.e., the model objective function) of the speech recognition model in the embodiment of the present invention. As shown in formula (1):

L _MOL ＝ λlog P _ctc (C|X) + (1-λ) log P _att (C|X) (1)

wherein ,L_MOL Representing an objective function 3; logPctc (c|x) represents the objective function 1 of the timing classification model, logpattern (c|x) represents the objective function 2 of the attention decoder; x represents the input voice feature, C represents the corresponding text label; it follows that the model in the embodiment of the invention classifies the objective function of the model, the attention decoder, by time sequenceThe training effect can be evaluated by utilizing multiple iterations to train the corresponding models of the objective functions and adjusting the weight lambda, so that the optimal recognition effect of the whole speech recognition model is achieved, namely, the training result of the preset speech recognition model is evaluated by utilizing the model objective functions, and the time sequence classification model and/or the attention decoder are adjusted according to the training result;

It can be understood that the value range of λ is [0,1], which can be, for example, 0.3 or 0.5 according to the training result evaluation; namely, superposing the objective function of the time sequence classification model and the objective function of the attention decoder to obtain a model objective function of the preset voice recognition model; training the preset voice recognition model; and evaluating a training result of a preset voice recognition model by using the model objective function, adjusting the time sequence classification model and/or the attention decoder according to the training result, and adjusting the weight of the objective function of the time sequence classification model or the objective function of the attention decoder in the model objective function.

Step S106: and determining a target recognition result of the voice data according to the recognition result of each recognized voice data block and sending the target recognition result.

Specifically, after performing a voice recognition operation on each voice data block (for example, chunk), a corresponding text is obtained, and then, according to a time sequence, a target text (target recognition result) of voice data composed of the voice data blocks is obtained based on the recognition result of each voice data block, and the recognition result is sent to a request end (for example, a client end, a server end, or the like) requesting voice recognition, for example: the client comprises live real-time captions, conference real-time records, voice input, voice awakening and other scenes.

As can be seen from the description of step S101 to step S106, in one embodiment of the present invention, when a preset encoder performs convolution operation, input voice data is divided separately according to voice data blocks (chunk), and then each chunk is sequentially subjected to non-causal convolution, so that future information of a voice frame in the chunk can be utilized to obtain a better stream recognition effect; the embodiment of the invention utilizes a non-causal convolution method to replace a causal convolution method under the condition of meeting the streaming identification requirement, so that a voice identification model can utilize future voice information in a time sequence to obtain a better streaming identification effect, and the accuracy of voice identification is improved.

As shown in fig. 4, an embodiment of the present invention provides a device 400 for speech recognition, including: a process voice data module 401, a recognize voice data module 402, and a send recognition result module 403; wherein,

the voice data processing module 401 is configured to receive voice data to be recognized in response to triggering voice recognition, and divide the voice data into a plurality of voice data blocks, where the voice data blocks include a plurality of initial voice frames that are continuous in time sequence;

The recognition voice data module 402 is configured to, for each of the voice data blocks, sequentially perform: adding a plurality of padding speech frames to the speech data block; extracting a plurality of groups of voice frame groups from the voice data block according to a set step length, wherein the total number of initial voice frames and filling voice frames contained in each group of voice frame groups is a set number, and the set step length indicates the position difference between extraction positions of every two adjacent groups of voice frame groups; according to a time sequence, inputting an initial voice frame and a filling voice frame contained in each voice frame group into a preset voice recognition model for each voice frame group to obtain a voice recognition result of the voice frame group; the number of convolution kernels contained in the preset voice recognition model is the set number;

the sending recognition result module 403 is configured to obtain a recognition result for the voice data block according to a voice recognition result corresponding to each voice frame group; and determining a target recognition result of the voice data according to the recognition result of each recognized voice data block and sending the target recognition result.

The embodiment of the invention also provides electronic equipment for voice recognition, which comprises: one or more processors; and a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method provided by any of the embodiments described above.

The embodiment of the invention also provides a computer readable medium, on which a computer program is stored, which when executed by a processor implements the method provided by any of the above embodiments.

Fig. 5 illustrates an exemplary system architecture 500 of a speech recognition method or speech recognition device to which embodiments of the present invention may be applied.

As shown in fig. 5, the system architecture 500 may include terminal devices 501, 502, 503, a network 504, and a server 505. The network 504 is used as a medium to provide communication links between the terminal devices 501, 502, 503 and the server 505. The network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 505 via the network 504 using the terminal devices 501, 502, 503 to receive or send messages or the like. Various client applications may be installed on the terminal devices 501, 502, 503, such as a client containing one or more scenes such as live real-time captioning, real-time recording of a meeting, voice input, voice wakeup, etc.

The terminal devices 501, 502, 503 may be a variety of electronic devices having a display screen and supporting a variety of client applications, including but not limited to smartphones, tablet computers, laptop and desktop computers, and the like.

The server 505 may be a server providing various services, such as a background management server providing support for client applications used by the user with the terminal devices 501, 502, 503. The background management server can process the received voice data to be recognized and feed back recognition results after the voice data are recognized to the terminal equipment.

It should be noted that, the method of voice recognition provided by the embodiment of the present invention is generally performed by the server 505, and accordingly, the device for voice recognition is generally disposed in the server 505.

It should be understood that the number of terminal devices, networks and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 6, there is illustrated a schematic diagram of a computer system 600 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 6 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU) 601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 601.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules and/or units involved in the embodiments of the present invention may be implemented in software, or may be implemented in hardware. The described modules and/or units may also be provided in a processor, e.g., may be described as: a processor comprises a voice data processing module, a voice data recognition module and a recognition result sending module; the names of these modules do not constitute a limitation on the module itself in some cases, and for example, a module that processes voice data may also be described as "a module that divides voice data into a plurality of voice data blocks".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: receiving voice data to be recognized in response to triggering voice recognition, and dividing the voice data into a plurality of voice data blocks, wherein the voice data blocks comprise a plurality of initial voice frames with continuous time sequences; for each of the blocks of speech data, sequentially performing: adding a plurality of padding speech frames to the speech data block; extracting a plurality of groups of voice frame groups from the voice data block according to a set step length, wherein the total number of initial voice frames and filling voice frames contained in each group of voice frame groups is a set number, and the set step length indicates the position difference between extraction positions of every two adjacent groups of voice frame groups; according to a time sequence, inputting an initial voice frame and a filling voice frame contained in each voice frame group into a preset voice recognition model for each voice frame group to obtain a voice recognition result of the voice frame group; the number of convolution kernels contained in the preset voice recognition model is the set number; obtaining a recognition result aiming at the voice data block according to the voice recognition result corresponding to each voice frame group; and determining a target recognition result of the voice data according to the recognition result of each recognized voice data block and sending the target recognition result.

According to the embodiment of the invention, voice data received in a voice recognition scene can be divided into a plurality of voice data blocks, a plurality of filling voice frames are added for each voice data block, voice frame groups containing a set number of voice frames are extracted based on the filled voice data blocks, and voice frames in the input voice frame groups are processed by utilizing a preset voice recognition model to obtain a recognition result; wherein the number of convolution kernels contained in the preset voice recognition model is consistent with the set number; the embodiment of the invention solves the problem of low voice recognition accuracy caused by the lack of relevance of voice information of the current frame and the voice frame after the current frame in one voice data block in the prior art, and improves the voice recognition effect.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of speech recognition, comprising:

Receiving voice data to be recognized in response to triggering voice recognition, and dividing the voice data into a plurality of voice data blocks, wherein the voice data blocks comprise a plurality of initial voice frames with continuous time sequences;

for each of the blocks of speech data, sequentially performing:

adding a plurality of padding speech frames to the speech data block;

extracting a plurality of groups of voice frame groups from the voice data block according to a set step length, wherein the total number of initial voice frames and filling voice frames contained in each group of voice frame groups is a set number, and the set step length indicates the position difference between extraction positions of every two adjacent groups of voice frame groups;

according to a time sequence, inputting an initial voice frame and a filling voice frame contained in each voice frame group into a preset voice recognition model for each voice frame group to obtain a voice recognition result of the voice frame group; the number of convolution kernels contained in the preset voice recognition model is the set number;

obtaining a recognition result aiming at the voice data block according to the voice recognition result corresponding to each voice frame group;

and determining a target recognition result of the voice data according to the recognition result of each recognized voice data block and sending the target recognition result.

2. The method of claim 1, wherein said adding a plurality of padded speech frames to said block of speech data comprises:

acquiring a time sequence of a plurality of initial voice frames contained in the voice data block;

taking the first voice frame as the first time sequence in a plurality of initial voice frames; a second speech frame as the last of the time series of the plurality of initial speech frames;

a plurality of padded speech frames are added before and after a plurality of time-series continuous initial speech frames contained in the speech data block, respectively.

3. The method as recited in claim 1, further comprising:

and determining the set step length according to the number of the initial voice frames and the number of the filling voice frames contained in the voice data block so as to enable the number of the initial voice frames contained in the voice data block to be consistent with the group number of the voice frame groups.

4. The method of claim 1, wherein the step of determining the position of the substrate comprises,

each of the groups of speech frames includes all of the initial speech frames; and the target voice frames at the set positions in each voice frame group are the main influencing factors of the preset voice recognition model, and the other voice frames outside the set positions are the associated influencing factors of the preset voice recognition model.

5. The method as recited in claim 4, further comprising:

and recognizing the target voice frame at the set position in the voice frame group through the voice recognition model, and taking the voice recognition result of the voice frame group as the voice recognition result of the target voice frame.

6. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the preset speech recognition model comprises the following steps: presetting an encoder; wherein the preset encoder comprises the set number of convolution kernels;

the inputting the initial voice frame and the filling voice frame contained in the voice frame group into a preset voice recognition model comprises the following steps:

inputting the initial voice frames and the filling voice frames contained in the voice frame group into the preset encoder, executing convolution operation on the initial voice frames and the filling voice frames contained in the voice frame group by utilizing the convolution kernels of the set quantity contained in the preset encoder, and outputting the characteristics of the voice frame group according to the convolution operation result.

7. The method of claim 6, wherein the step of providing the first layer comprises,

the preset speech recognition model further comprises: a time sequence classification model, an attention decoder;

inputting the initial voice frame and the filling voice frame contained in the voice frame group into a preset voice recognition model to obtain a voice recognition result of the voice frame group, wherein the voice recognition result comprises the following steps:

Inputting the characteristics of the voice frame group into the time sequence classification model, and acquiring a first text characteristic corresponding to the voice frame group output by the time sequence classification model;

and inputting the characteristics of the voice frame group and the first text characteristics of the voice frame group into the attention decoder to obtain a voice recognition result of the voice frame group.

8. The method as recited in claim 7, further comprising:

superposing the objective function of the time sequence classification model and the objective function of the attention decoder to obtain a model objective function of the preset voice recognition model;

training the preset voice recognition model;

and evaluating a training result of a preset voice recognition model by using the model objective function, adjusting the time sequence classification model and/or the attention decoder according to the training result, and adjusting the weight of the objective function of the time sequence classification model or the objective function of the attention decoder in the model objective function.

9. An apparatus for speech recognition, comprising: processing a voice data module, identifying the voice data module and sending an identification result module; wherein,

10. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-8.

11. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-8.