CN113327603A

CN113327603A - Speech recognition method, speech recognition device, electronic equipment and computer-readable storage medium

Info

Publication number: CN113327603A
Application number: CN202110637229.6A
Authority: CN
Inventors: 刘柏基
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2021-08-31
Anticipated expiration: 2041-06-08
Also published as: CN113327603B

Abstract

The embodiment of the invention provides a voice recognition method, a voice recognition device, electronic equipment and a computer readable storage medium, and relates to the field of voice recognition. The method comprises the following steps: the method comprises the steps of obtaining a voice feature sequence corresponding to a voice signal to be recognized, blocking the voice feature sequence to obtain a plurality of voice feature blocks, inputting the voice feature blocks into a pre-trained voice recognition model, carrying out self-attention coding on the voice feature blocks by using an encoder of the voice recognition model to obtain output features of the encoder, carrying out truncation processing on the output features of the encoder by using a connection time sequence classification module of the voice recognition model to obtain a plurality of feature segments, and decoding each feature segment by using a decoder of the voice recognition model to obtain a recognition result corresponding to the voice signal to be recognized. Therefore, real-time voice recognition is realized, higher recognition accuracy can be guaranteed, and the fluency of voice interaction is improved, so that a better recognition effect is obtained under a flow-type voice recognition scene.

Description

Speech recognition method, speech recognition device, electronic equipment and computer-readable storage medium

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a speech recognition method, apparatus, electronic device, and computer-readable storage medium.

Background

Speech recognition refers to a process of converting speech signals into corresponding texts through a computer, and is one of the main ways of realizing human-machine interaction.

Traditional speech recognition is mainly based on hidden markov model-deep neural network (HMM-DNN) modeling. Due to the modeling limitation of the hidden Markov model, and a plurality of manual rules such as pronunciation dictionaries, language models and the like used by a decoder; these manual rules, while achieving good results when the amount of data is small, do not fully exploit the modeling potential when the amount of data is large. Compared with the traditional speech recognition, the end-to-end speech recognition can directly model the mapping from the audio sequence to the character sequence, the modeling process is simpler than that of the traditional method, and the modeling potential can be fully exerted when the data volume is large due to the removal of manual rules.

However, in the current end-to-end speech recognition model (e.g., a transform model) based on the attention mechanism, the attention mechanism relies on feature input at all times during calculation, that is, when recognizing the content of a segment of speech, the entire segment of speech needs to be input, and real-time recognition cannot be performed.

Disclosure of Invention

In view of the above, the present invention provides a speech recognition method, an apparatus, an electronic device, and a computer-readable storage medium, which solve the problems in the prior art that an end-to-end speech recognition model requires complete speech input, cannot recognize in real time, and cannot achieve a good recognition effect in a streaming speech recognition scenario.

In order to achieve the above purpose, the embodiment of the present invention adopts the following technical solutions:

in a first aspect, the present invention provides a speech recognition method, comprising:

acquiring a voice characteristic sequence corresponding to a voice signal to be recognized;

partitioning the voice feature sequence to obtain a plurality of voice feature blocks;

inputting the voice feature blocks into a pre-trained voice recognition model, and performing self-attention coding on the voice feature blocks by using a coder of the voice recognition model to obtain output features of the coder;

utilizing a connection time sequence classification module of the voice recognition model to carry out truncation processing on the output characteristics of the encoder to obtain a plurality of characteristic segments;

and decoding each characteristic segment by utilizing a decoder of the voice recognition model to obtain a recognition result corresponding to the voice signal to be recognized.

In an alternative embodiment, the encoder includes a plurality of attention layers, and the self-attention encoding the plurality of speech feature blocks by the encoder of the speech recognition model to obtain the output features of the encoder includes:

after each attention layer acquires input information, determining context information corresponding to each information block in the input information according to context configuration information corresponding to the attention layer, and performing self-attention calculation according to each information block and the context information corresponding to each information block to obtain an output result of the attention layer; wherein, the input information of a first attention layer in the plurality of attention layers is the plurality of voice feature blocks, and the input information of other attention layers except the first attention layer is the output result of the previous attention layer;

and taking the output result of the last attention layer in the plurality of attention layers as the output characteristic of the encoder.

In an optional embodiment, the context information corresponding to each information block includes a previous information block and/or a next information block adjacent to the information block.

In an optional embodiment, the performing, by using the connection timing classification module of the speech recognition model, the truncation processing on the output feature of the encoder to obtain a plurality of feature segments includes:

acquiring a probability vector generated by the connection time sequence classification module according to the output characteristics of the encoder;

determining a truncation point corresponding to each character according to the probability vector;

and truncating the output characteristic of the encoder into a plurality of characteristic segments according to the truncation point.

In an optional embodiment, the determining, according to the probability vector, a truncation point corresponding to each word includes:

acquiring a character output probability corresponding to the current moment and a character output probability corresponding to the previous moment of the current moment according to the probability vector;

and if the character output probability corresponding to the current moment represents that characters appear at the current moment, and the character output probability corresponding to the previous moment represents that no characters appear at the previous moment or that the characters appearing at the previous moment are different from the characters appearing at the current moment, determining the current moment as an interception point.

In an alternative embodiment, the truncating the output feature of the encoder into a plurality of feature segments according to the truncation point includes:

determining a truncation window with a preset time length according to each truncation point;

and taking the output characteristic of the encoder corresponding to each truncation window as a characteristic segment, thereby truncating the output characteristic of the encoder into a plurality of characteristic segments.

In an optional embodiment, the decoding, by the decoder using the speech recognition model, each feature segment to obtain a recognition result corresponding to the speech signal to be recognized includes:

calculating the initial character output probability of the decoder at the current moment according to the feature segment corresponding to the current moment and all character output results of the decoder before the current moment;

calculating a character output result of the decoder at the current moment according to the initial character output probability of the decoder at the current moment and the character output probability of the connection time sequence classification module at the current moment;

and obtaining a recognition result corresponding to the voice signal to be recognized according to the character output result of the decoder at each moment.

In a second aspect, the present invention provides a speech recognition apparatus, comprising:

the characteristic extraction module is used for acquiring a voice characteristic sequence corresponding to a voice signal to be recognized;

the feature processing module is used for partitioning the voice feature sequence to obtain a plurality of voice feature blocks;

a recognition result determining module, configured to input the multiple speech feature blocks into a pre-trained speech recognition model, and perform self-attention coding on the multiple speech feature blocks by using an encoder of the speech recognition model to obtain output features of the encoder; utilizing a connection time sequence classification module of the voice recognition model to carry out truncation processing on the output characteristics of the encoder to obtain a plurality of characteristic segments; and decoding each characteristic segment by utilizing a decoder of the voice recognition model to obtain a recognition result corresponding to the voice signal to be recognized.

In a third aspect, the present invention provides an electronic device comprising a processor and a memory, wherein the memory stores a computer program, and the processor implements the method of any one of the preceding embodiments when executing the computer program.

In a fourth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the preceding embodiments.

The embodiment of the invention provides a voice recognition method, a voice recognition device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining a voice feature sequence corresponding to a voice signal to be recognized, blocking the voice feature sequence to obtain a plurality of voice feature blocks, inputting the voice feature blocks into a pre-trained voice recognition model, carrying out self-attention coding on the voice feature blocks by using an encoder of the voice recognition model to obtain output features of the encoder, carrying out truncation processing on the output features of the encoder by using a connection time sequence classification module of the voice recognition model to obtain a plurality of feature segments, and decoding each feature segment by using a decoder of the voice recognition model to obtain a recognition result corresponding to the voice signal to be recognized. The embodiment of the invention divides the voice feature sequence sent to the attention mechanism into blocks, so that the encoder carries out self-attention calculation based on the divided voice feature blocks, the global dependency of the self-attention mechanism is effectively broken, meanwhile, the truncation processing of the connection time sequence classification module is combined, a decoder can decode each feature segment obtained by truncation, not only is real-time voice recognition realized, but also higher recognition accuracy can be ensured, the fluency of voice interaction is improved, and a better recognition effect is obtained in a streaming voice recognition scene.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a block diagram of an electronic device provided by an embodiment of the invention;

FIG. 2 is a flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 3 illustrates an architectural diagram of a speech recognition model;

FIG. 4 is a flow chart illustrating another exemplary speech recognition method according to an embodiment of the present invention;

FIG. 5 shows a schematic diagram of a structure of an encoder;

FIG. 6 shows a schematic diagram of context information in an attention layer;

FIG. 7 is a flow chart illustrating a speech recognition method according to an embodiment of the present invention;

FIG. 8 is a schematic diagram showing the connection of a temporal classification module to truncate the output features of an encoder;

FIG. 9 is a flow chart illustrating a speech recognition method according to an embodiment of the present invention;

FIG. 10 shows a data flow diagram when the attention layer considers context information;

FIG. 11 shows a data flow diagram for a portion of the attention layer without regard to contextual information;

fig. 12 is a functional block diagram of a speech recognition apparatus according to an embodiment of the present invention.

Icon: 100-an electronic device; 110-a memory; 120-a processor; 130-a communication module; 400-a speech recognition device; 410-a feature extraction module; 420-feature processing module; 430-level result determination module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It is noted that relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The voice recognition method and the voice recognition device provided by the embodiment of the invention can be applied to an end-to-end voice recognition scene, and can effectively reduce the waiting time of a user and improve the fluency of voice interaction, thereby obtaining a better voice recognition effect, especially under the streaming voice recognition scenes of real-time subtitles, virtual human real-time chatting, intelligent customer service, live broadcast content real-time analysis, commercial recommendation and the like.

Fig. 1 is a block diagram of an electronic device 100 according to an embodiment of the invention. The electronic device 100 includes a memory 110, a processor 120, and a communication module 130. The memory 110, the processor 120, and the communication module 130 are electrically connected to each other directly or indirectly to enable data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.

The memory 110 is used to store programs or data. The Memory 110 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

The processor 120 is used to read/write data or programs stored in the memory 110 and perform corresponding functions. For example, the speech recognition method disclosed by the embodiment of the present invention can be implemented when the processor 120 executes the computer program stored in the memory 110.

The communication module 130 is used for establishing a communication connection between the electronic device 100 and another communication terminal through a network, and for transceiving data through the network.

In the present embodiment, the electronic device 100 may be, but is not limited to, a server, a PC (Personal Computer), a smart phone, a tablet Computer, a smart wearable device (such as a smart watch and smart glasses), a navigation device, a multimedia player device, an education device, a game device, a smart speaker, and the like.

It should be understood that the configuration shown in fig. 1 is merely a schematic diagram of the configuration of the electronic device 100, and that the electronic device 100 may include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.

Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, and the computer program can implement the speech recognition method disclosed in the embodiments of the present invention when executed by the processor 120.

Fig. 2 is a schematic flow chart of a speech recognition method according to an embodiment of the present invention. It should be noted that the speech recognition method provided by the embodiment of the present invention is not limited by fig. 2 and the following specific sequence, and it should be understood that, in other embodiments, the sequence of some steps in the speech recognition method provided by the embodiment of the present invention may be interchanged according to actual needs, or some steps in the speech recognition method may be omitted or deleted. The speech recognition method can be applied to the electronic device 100 shown in fig. 1, and the specific flow shown in fig. 2 will be described in detail below.

Step S201, a speech feature sequence corresponding to the speech signal to be recognized is obtained.

In this embodiment, the speech signal to be recognized may be speech data input by a user. For example, when a user wants to interact with the electronic device 100 in a voice manner, the user can directly speak into a voice collecting device on the electronic device 100, the electronic device 100 obtains the spoken words of the user through the voice collecting device as a voice signal to be recognized, and a corresponding voice feature sequence can be obtained by performing feature extraction on the voice signal to be recognized.

Step S202, the voice feature sequence is partitioned to obtain a plurality of voice feature blocks.

In this embodiment, the speech feature sequence is usually a relatively long sequence, and when the speech feature sequence is partitioned, the speech feature sequence may be partitioned into relatively small blocks according to a set time length, so as to obtain a plurality of speech feature blocks.

For example, after the speech feature sequence [0, T ] is cut into small blocks, the obtained speech feature blocks can be respectively represented as [0, T1], [ T1, T2], [ tn, T ].

Step S203, inputting the plurality of voice feature blocks into a pre-trained voice recognition model, and performing self-attention coding on the plurality of voice feature blocks by using a coder of the voice recognition model to obtain the output features of the coder.

In this embodiment, the self-attention coding refers to a process in which an encoder performs self-attention calculation on a plurality of input speech feature blocks to finally obtain an output feature. The electronic device 100 stores a pre-trained speech recognition model, where the speech recognition model refers to a model that has been trained to be mature, the speech recognition model may be any model that can perform speech recognition, in this embodiment, a streaming end-to-end speech recognition model that can implement real-time speech recognition is adopted, and fig. 3 exemplarily shows a schematic structural diagram of the speech recognition model.

As shown in fig. 3, the speech recognition model mainly includes an encoder, a CTC (connection Temporal Classification) module, and a decoder, where the implementation of the streaming attention mechanism in the encoder mainly performs slice calculation on input features, which is equivalent to cutting source audio into many small segments for processing respectively; the realization of the stream type attention mechanism of the encoder-decoder mainly comprises the step of carrying out truncation processing on the output characteristics of the encoder through a CTC module so as to ensure incremental calculation. Wherein the attention mechanism in the encoder is different from that of the encoder-decoder, and the input and output lengths of the attention mechanism in the encoder are in one-to-one correspondence.

Specifically, after the speech feature sequence is partitioned into a plurality of speech feature blocks, an encoder of the speech recognition model performs self-attention coding on the speech feature blocks based on its own attention mechanism after inputting the speech feature blocks into the speech recognition model, the self-attention is calculated in increments of blocks, the obtained attention result determines how much attention is put on other related speech feature blocks when a speech feature block is coded, and finally the output feature of the encoder is obtained, and the output feature of the encoder is used as the input of the CTC module. Therefore, the attention information can be calculated in a block-by-block incremental mode by partitioning the information sent to the attention mechanism, and the streaming effect is achieved.

And step S204, utilizing a connection time sequence classification module of the voice recognition model to intercept the output characteristics of the encoder to obtain a plurality of characteristic segments.

In this embodiment, a CTC-based interceptor (i.e., the connection timing classification module in this embodiment) can be obtained by jointly training CTC and Transformer models. The connection time sequence classification module can obtain truncation information for truncating the output characteristics of the encoder by performing CTC processing on the output characteristics of the encoder, and further truncates the output characteristics of the encoder into a plurality of characteristic segments based on the truncation information.

And S205, decoding each feature segment by using a decoder of the voice recognition model to obtain a recognition result corresponding to the voice signal to be recognized.

In this embodiment, each feature segment output by the connection timing classification module is sent to a decoder for decoding, and finally, a recognition result corresponding to the speech signal to be recognized is obtained according to a decoding result of each feature segment by the decoder.

It can be seen that, in the speech recognition method provided in the embodiment of the present invention, the speech feature sequence corresponding to the speech signal to be recognized is obtained, the speech feature sequence is partitioned to obtain a plurality of speech feature blocks, the plurality of speech feature blocks are input into the pre-trained speech recognition model, the plurality of speech feature blocks are subjected to self-attention coding by using the encoder of the speech recognition model to obtain the output features of the encoder, the output features of the encoder are subjected to truncation processing by using the connection timing classification module of the speech recognition model to obtain a plurality of feature segments, and each feature segment is decoded by using the decoder of the speech recognition model to obtain the recognition result corresponding to the speech signal to be recognized. The embodiment of the invention divides the voice feature sequence sent to the attention mechanism into blocks, so that the encoder carries out self-attention calculation based on the divided voice feature blocks, the global dependency of the self-attention mechanism is effectively broken, meanwhile, the truncation processing of the connection time sequence classification module is combined, a decoder can decode each feature segment obtained by truncation, not only is real-time voice recognition realized, but also higher recognition accuracy can be ensured, the fluency of voice interaction is improved, and a better recognition effect is obtained in a streaming voice recognition scene.

In this embodiment, the internal structures of the encoder and the decoder described above may be implemented by using a transform model, but in other implementations, the encoder and the decoder may also have other structures.

In an embodiment, the encoder may include a plurality of attention layers, and a specific number of the attention layers may be configured according to actual needs, which is not limited in this embodiment of the present invention; more useful information can be extracted from the input information of the encoder through the plurality of attention layers, thereby obtaining the output characteristics of the encoder. As shown in fig. 4, taking an example that the structure of the encoder is implemented by using a transform model, the encoder may be composed of N (e.g., N ═ 6) layers based on a self-attention mechanism attention layer, and each attention layer in the encoder may be understood as a transform encoder sublayer with a self-attention module, each attention layer includes a self-attention module and a feedforward neural network, and may include other modules. In practical application, the number of modules included in each attention layer can be increased or decreased according to needs. Referring to fig. 5, the step S203 may include the following sub-steps:

step S2031, after each attention layer obtains the input information, determining the context information corresponding to each information block in the input information according to the context configuration information corresponding to the attention layer, and performing self-attention calculation according to each information block and the context information corresponding to each information block to obtain the output result of the attention layer; the input information of the first attention layer in the plurality of attention layers is a plurality of voice feature blocks, and the input information of the other attention layers except the first attention layer is the output result of the previous attention layer.

In sub-step S2032, the output result of the last attention layer of the plurality of attention layers is used as the output characteristic of the encoder.

In this embodiment, the context information corresponding to each information block may include a previous information block and/or a next information block adjacent to the information block, that is, the context information of the information block may be considered when performing the self-attention calculation of the information block in each attention layer.

For each attention layer, corresponding context configuration information is configured in advance, and the context configuration information can be used for judging whether the context information is considered when the attention layer performs self-attention calculation, and which content the context information corresponding to the information block includes. In this embodiment, the following information includes a previous information block and a next information block of the current information block.

In this embodiment, after a plurality of speech feature blocks are input to an encoder of a speech recognition model as input information, a first attention layer in the encoder performs self-attention calculation according to each information block (i.e., speech feature block) and context information corresponding to each information block, so as to obtain an output result of the first attention layer. The output result of the first attention layer is used as the input information of a second attention layer (namely, the attention layer behind the first attention layer), the input information of the second attention layer also comprises a plurality of information blocks, the second attention layer performs self-attention calculation on the information blocks in the same way, and the finally obtained output result is used as the input information of the next attention layer and is analogized in turn until the last attention layer completes the processing of the input information to obtain the corresponding output result. Finally, the output result of the last attention layer is taken as the output characteristic of the encoder.

The attention layer in this embodiment is a multi-head attention layer, which means that the attention layer calculates a plurality of groups of attentions, and each group of attentions focuses on different parts of the input information to extract information from different angles. For example, assuming that the feature dimension of the input information is 256 and the attention layer includes 4 attention heads, each attention layer actually calculates 4 sets of attention results, each set of output dimension is 64, and finally the 4 sets of 64-dimensional attention results are spliced (concat) together to be 256-dimensional to obtain the final output result.

In addition, the output result at any time of each attention layer can be regarded as one information block, and the output result at different times is input to the next attention layer, and then the next attention layer performs the concatenation processing on the information block at the previous time (assumed to be [ t6, t7]), the information block at the current time (assumed to be [ t7, t8]), and the information block at the next time (assumed to be [ t8, t9]) to obtain a data segment (i.e., [ t6, t9]) including the current information block and the context information corresponding to the current information block, and then performs the self-attention calculation based on the data segment obtained by the concatenation. It should be noted that, when splicing data segments, for the information blocks at the initial time and the end time, since there is no corresponding information block at the previous time of the initial time and there is no corresponding information block at the next time of the end time, an information block with data of all 0 may be supplemented as an information block at the previous time and an information block at the next time, respectively. The format and volume of the supplementary information block are the same as those of the conventional information block, but the data in the supplementary information block is all 0, while the data in the conventional data block has a size but is not 0.

As shown in fig. 6, current chunk is the current information block, left context and right context are context information corresponding to the current information block, for each information block, a query vector query (q), a key vector key (k), and a value vector value (v) may be respectively created, and for a certain query vector query (current information block)Information block) to obtain a weight coefficient of a value corresponding to each key by calculating similarity or correlation between the query and each key, and obtaining an attention result by performing weighted summation on the weight coefficient and the corresponding value after normalization by a softmax function, wherein the attention calculation formula can refer to

The output result of the final attention layer may be denoted as output ═ multihead attention (query, key, value). In this embodiment, key and value are equal, and the content includes the current information block and the context information corresponding to the current information block; d_kIs the dimension of the key vector key.

Therefore, the speech recognition method provided by the embodiment of the invention can achieve the streaming effect by partitioning the information sent to the attention mechanism, and taking the context information of the information block into consideration in the attention calculation of each attention layer, without depending on the feature input at all times. And judging whether the context information is considered or not and what contents the context information corresponding to the information block comprises by utilizing the context configuration information corresponding to each attention layer, so that the delay of the voice recognition model can be effectively controlled, and the voice interaction experience of the user is improved.

In one embodiment, referring to fig. 7, the step S204 may include the following sub-steps:

and a substep S2041 of obtaining a probability vector generated by the connection timing sequence classification module according to the output characteristics of the encoder.

And a substep S2042 of determining a truncation point corresponding to each character according to the probability vector.

And a substep S2043 of truncating the output features of the encoder into a plurality of feature segments according to the truncation point.

In this embodiment, after receiving the output features of the encoder, the connection timing classification module may calculate the output features of the encoder by using a CTC algorithm, so as to generate a corresponding probability vector. The CTC algorithm is used to solve the classification problem of time series data, and its input is usually the feature frame of speech and its output is the character classification of the corresponding frame.

It can be understood that the probability vector may include different types of probabilities, and a probability peak (i.e., the above truncation information) is generated at a time when a text occurs, that is, at the time when the text occurs, the probability of the corresponding text is very high; at times when there is no text, the probability of a blank character is large. Therefore, the corresponding truncation point of each character can be determined by utilizing the probability peak information in the probability vector, and the output features of the encoder are truncated into a plurality of feature segments according to the truncation point.

Optionally, the connection timing classification module may determine the truncation point corresponding to each word according to the probability vector in the following manner, that is, the sub-step S2042 includes: acquiring a character output probability corresponding to the current moment and a character output probability corresponding to the previous moment of the current moment according to the probability vector; and if the character output probability corresponding to the current moment represents that characters appear at the current moment, and the character output probability corresponding to the previous moment represents that no characters appear at the previous moment or that the characters appearing at the previous moment are different from the characters appearing at the current moment, determining the current moment as the truncation point.

In this embodiment, the text output probability corresponding to the current time may be understood as a maximum value among different types of probabilities generated by the connection timing classification module at the current time, where the different types of probabilities include a probability that no text appears (blank characters) and a probability that various types of text may appear; similarly, the character output probability corresponding to the previous moment is the maximum value of the probabilities of different classes generated by the connection time sequence classification module at the previous moment. Therefore, whether the current time corresponds to a blank character or a character can be judged based on the character output probability corresponding to the current time, and if the current time corresponds to the blank character or the character, which character is most likely to appear is judged; based on the output probability of the characters corresponding to the previous moment, whether the characters correspond to blank characters or characters at the previous moment can be determined, and if the characters correspond to the blank characters or the characters, which characters are most likely to appear.

Assuming that the current time is T time, judging whether a character (namely, a character appears) corresponds to the current time according to the character output probability corresponding to the T time, and judging whether a blank character corresponds to the T-1 time (namely, the previous time of the current time) or whether the character corresponding to the T-1 time is different from the character corresponding to the T time, wherein the character output probability corresponding to the T time is a probability peak, namely, the T time can be determined to be an interception point of the corresponding character.

Optionally, the connection timing classification module may truncate the output feature of the encoder into a plurality of feature segments according to the truncation point, where the sub-step S2043 includes: determining a truncation window with a preset time length according to each truncation point; and taking the output characteristic of the encoder corresponding to each truncation window as a characteristic segment, thereby truncating the output characteristic of the encoder into a plurality of characteristic segments.

For example, the length of the truncation window is preset as a preset time length in the connection timing sequence classification module, when the T time is determined as the truncation point of the corresponding character, the window with the preset time length including the truncation point is selected as the truncation window, and the output feature of the encoder corresponding to the truncation window is used as the intercepted feature segment. That is, after the truncation point is determined, the output characteristics of the encoder within a certain window before and after the truncation point are used as the truncated context information (characteristic segment), so that the output characteristics of the encoder can be truncated into a plurality of characteristic segments, and then the truncated characteristic segments are sent to the decoder for decoding. As shown in fig. 8, after the speech feature blocks are processed by the encoder, the output features of the encoder are sent to the connection timing classification module for truncation, and after the connection timing classification module determines the truncation point, the output features of the encoder in a certain window (shown by a dashed line box in fig. 8) before and after the truncation point are truncated as feature segments and sent to the decoder for decoding.

It should be noted that, in the application process of the speech recognition model, the information window sent to the decoder may be determined based on the truncation mechanism in the connection timing classification module, but in the training process of the speech recognition model, all encoder information is sent to the decoder, and at this time, there is a problem that training and decoding are inconsistent, so that the performance of the speech recognition model is poor. In order to solve this problem, in the embodiment of the present invention, when training a speech recognition model, the alignment information generated by using the CTC probability is used to truncate the output features of the encoder, and then the truncated output features are sent to the decoder for model training. The alignment information refers to the correspondence between the text and the audio, i.e. one text may correspond to a segment of the audio range.

Therefore, according to the speech recognition method provided by the embodiment of the invention, the probability vector generated by the connection time sequence classification module according to the output characteristics of the encoder is obtained, the truncation point corresponding to each character is determined according to the probability peak information in the probability vector, and the output characteristics of the encoder in a certain window before and after the truncation point are used as the truncated context information, so that the output characteristics of the encoder are truncated into a plurality of characteristic segments.

In one embodiment, referring to fig. 9, the step S205 may include the following sub-steps:

and a substep S2051, calculating the initial character output probability of the decoder at the current moment according to the feature segment corresponding to the current moment and all character output results of the decoder before the current moment.

In this embodiment, the feature segment (i.e., the truncated context information) corresponding to the current time and all the text output results of the decoder before the current time may be used as the input of the decoder, and the decoder calculates the initial text output probability of the current time.

And a substep S2052 of calculating a text output result of the decoder at the current moment according to the initial text output probability of the decoder at the current moment and the text output probability of the connection timing classification module at the current moment.

In this embodiment, joint decoding may be performed using the initial text output probability output by the decoder and the text output probability output by the connection timing classification module to obtain the actual text output probability (i.e., the joint decoding result) of the decoder at the current time, and then the text output result of the decoder at the current time is obtained according to the joint decoding result.

In one example, if the truncation point corresponding to the ith word is set to t, the calculation formula of the streaming CTC joint decoding adopted in the decoder can be expressed as: logP (Y)_i)＝logDecoder(Y_1：i-1，X_truncated(t))+γlogCTC(X_t)

Wherein, Y_iThe ith character output (namely the character output result at the time t) is shown; y is_1：i-1Is from the 1 st character output to the i-1 character output (namely all character output results before the time t); x_truncated(t) context information (namely, a feature segment corresponding to the t moment) representing that the output feature of the encoder is truncated at the t moment; decoder (Y)_1：i-1，X_trumcated(t)) Representing the input Y to the decoder_1：i-1And X_trumcated(t)These two pieces of information; log Decoder (Y)_1：i-1，X_truncated(t)) Representing the initial character output probability of the decoder at the time t; logCTC (X)_t) The probability of the character output of CTC at the moment t (the probability of the truncation point); gamma denotes a weight coefficient, which can be used for tuning. By using the calculation formula, the actual word output probability logP (Y) of the decoder at the time t can be obtained_i) Further, the actual character output probability logP (Y) is used as the output probability_i) Obtaining the character output result Y of the decoder at the time t_i。

And a substep S2053 of obtaining a recognition result corresponding to the voice signal to be recognized according to the character output result of the decoder at each moment.

In this embodiment, each feature segment obtained by truncation is sequentially sent to a decoder for decoding by connecting a time sequence classification module, so that a text output result at a corresponding moment can be obtained, and a recognition result corresponding to the voice signal to be recognized is obtained.

It can be seen that, in the speech recognition method provided in the embodiment of the present invention, in consideration of the characteristics of the truncation mechanism itself in the connection timing classification module, a prefix tree search algorithm cannot be used, and the calculation amount of the prefix tree search algorithm itself is large, so that the CTC probability corresponding to the truncation point and the initial character output probability output by the decoder are used to perform joint decoding, and further the character output result at the corresponding time is obtained.

It should be noted that, in practical application, whether each attention layer in the encoder considers context information when performing self-attention calculation, and whether the context information includes only a previous information block, only a next information block, or both the previous information block and the next information block may be configured accordingly, so that each attention layer has corresponding context configuration information, and a part of the attention layers can be controlled by the context configuration information not to expand future information, thereby achieving the effect of delaying the compressed speech recognition model.

For example, when all attention layers are configured to perform self-attention calculation, context information is considered, and the context information is a previous information block and a next information block, the data stream on each attention layer in the encoder may refer to fig. 10, where the delay calculation manner of the whole speech recognition is as follows: delay-block width (1+ number of attention layers) + length of truncated context information. In order to further compress the delay, it can be controlled that part of the attention layer does not expand the future information, and at this time, the data flow on each attention layer in the encoder can refer to fig. 11, where the solid frame part in fig. 11 indicates that the future information is not expanded, and the delay of the whole speech recognition is calculated as follows: delay-block width (1+ number of layers of extended context) + length of truncated context information.

In summary, in the speech recognition method provided by the embodiment of the present invention, a set of closely-linked systems is formed by organically coupling the CTC alignment pre-training, the CTC truncation mechanism, the CTC joint decoding mechanism, the attention mechanism based on blocking in the encoder, and the like, so that the decoding requirements can be met by one-time training, the model construction process is greatly simplified, the modeling performance of the streaming end-to-end speech recognition model can be significantly improved, the modeling potential of the streaming end-to-end speech recognition model can be fully exerted, and the performance of the streaming speech recognition application scenario can be improved. In the actual voice recognition process, under the same delay, the method has better calculation speed and recognition accuracy.

In order to perform the corresponding steps in the above embodiments and various possible manners, an implementation manner of the speech recognition apparatus is given below. Fig. 12 is a functional block diagram of a speech recognition apparatus 400 according to an embodiment of the present invention. It should be noted that the basic principle and the generated technical effect of the speech recognition apparatus 400 provided in the present embodiment are the same as those of the above embodiments, and for the sake of brief description, no part of the present embodiment is mentioned, and corresponding contents in the above embodiments may be referred to. The speech recognition apparatus 400 includes a feature extraction module 410, a feature processing module 420, and a recognition result determination module 430.

Alternatively, the modules may be stored in the memory 110 shown in fig. 1 in the form of software or Firmware (Firmware) or be fixed in an Operating System (OS) of the electronic device 100, and may be executed by the processor 120 in fig. 1. Meanwhile, data, codes of programs, and the like required to execute the above-described modules may be stored in the memory 110.

The feature extraction module 410 is configured to obtain a speech feature sequence corresponding to a speech signal to be recognized.

It is understood that the feature extraction module 410 may perform the step S201.

The feature processing module 420 is configured to block the speech feature sequence to obtain a plurality of speech feature blocks.

It is understood that the feature processing module 420 may perform the above step S202.

The recognition result determining module 430 is configured to input the plurality of speech feature blocks into a pre-trained speech recognition model, and perform self-attention coding on the plurality of speech feature blocks by using an encoder of the speech recognition model to obtain output features of the encoder; utilizing a connection time sequence classification module of a voice recognition model to carry out truncation processing on the output characteristics of the encoder to obtain a plurality of characteristic segments; and decoding each characteristic segment by using a decoder of the voice recognition model to obtain a recognition result corresponding to the voice signal to be recognized.

It is understood that the recognition result determining module 430 may perform the above-mentioned steps S203 to S205.

Optionally, the encoder includes multiple attention layers, and the recognition result determining module 430 may be specifically configured to determine, after each attention layer acquires the input information, context information corresponding to each information block in the input information according to context configuration information corresponding to the attention layer, and perform self-attention calculation according to each information block and the context information corresponding to each information block, so as to obtain an output result of the attention layer; the input information of a first attention layer in the plurality of attention layers is a plurality of voice feature blocks, and the input information of other attention layers except the first attention layer is the output result of the previous attention layer; and taking the output result of the last attention layer in the plurality of attention layers as the output characteristic of the encoder.

Wherein the context information corresponding to each information block comprises a previous information block and/or a next information block adjacent to the information block.

It is understood that the recognition result determining module 430 may perform the above-described sub-step S2031 to sub-step S2032.

Optionally, the identification result determining module 430 may be further configured to obtain a probability vector generated by the connection timing classification module according to the output feature of the encoder; determining a truncation point corresponding to each character according to the probability vector; the output features of the encoder are truncated into feature segments according to a truncation point.

The recognition result determining module 430 is configured to obtain a text output probability corresponding to a current time and a text output probability corresponding to a previous time of the current time according to the probability vector; and if the character output probability corresponding to the current moment represents that characters appear at the current moment, and the character output probability corresponding to the previous moment represents that no characters appear at the previous moment or that the characters appearing at the previous moment are different from the characters appearing at the current moment, determining the current moment as the truncation point. Determining a truncation window with a preset time length according to each truncation point; and taking the output characteristic of the encoder corresponding to each truncation window as a characteristic segment, thereby truncating the output characteristic of the encoder into a plurality of characteristic segments.

It is to be understood that the recognition result determining module 430 may perform the sub-steps S2041 to S2043 described above.

Optionally, the recognition result determining module 430 may be further configured to calculate an initial text output probability of the decoder at the current time according to the feature segment corresponding to the current time and all text output results of the decoder before the current time; calculating the character output result of the decoder at the current moment according to the initial character output probability of the decoder at the current moment and the character output probability of the connection time sequence classification module at the current moment; and obtaining a recognition result corresponding to the voice signal to be recognized according to the character output result of the decoder at each moment.

It is to be understood that the recognition result determining module 430 may perform the above-described substeps S2051 to S2053.

It can be seen that, in the speech recognition apparatus 400 provided in the embodiment of the present invention, the feature extraction module 410 obtains a speech feature sequence corresponding to a speech signal to be recognized, the feature processing module 420 blocks the speech feature sequence to obtain a plurality of speech feature blocks, the recognition result determination module 430 inputs the plurality of speech feature blocks into a pre-trained speech recognition model, and performs self-attention coding on the plurality of speech feature blocks by using a coder of the speech recognition model to obtain an output feature of the coder; utilizing a connection time sequence classification module of a voice recognition model to carry out truncation processing on the output characteristics of the encoder to obtain a plurality of characteristic segments; and decoding each characteristic segment by using a decoder of the voice recognition model to obtain a recognition result corresponding to the voice signal to be recognized. The embodiment of the invention divides the voice feature sequence sent to the attention mechanism into blocks, so that the encoder carries out self-attention calculation based on the divided voice feature blocks, the global dependency of the self-attention mechanism is effectively broken, meanwhile, the truncation processing of the connection time sequence classification module is combined, a decoder can decode each feature segment obtained by truncation, not only is real-time voice recognition realized, but also higher recognition accuracy can be ensured, the fluency of voice interaction is improved, and a better recognition effect is obtained in a streaming voice recognition scene.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of speech recognition, the method comprising:

2. The method of claim 1, wherein the encoder comprises a plurality of attention layers, and wherein the self-attention encoding the plurality of speech feature blocks by the speech recognition model encoder to obtain the output features of the encoder comprises:

3. The method according to claim 2, wherein the context information corresponding to each information block comprises a previous information block and/or a next information block adjacent to the information block.

4. The method of claim 1, wherein the truncating the output features of the encoder by using the connection timing classification module of the speech recognition model to obtain a plurality of feature segments comprises:

5. The method of claim 4, wherein determining the truncation point for each word according to the probability vector comprises:

6. The method of claim 4, wherein truncating the output features of the encoder into a plurality of feature segments according to the truncation point comprises:

7. The method according to claim 1, wherein the decoding each feature segment by using the decoder of the speech recognition model to obtain a recognition result corresponding to the speech signal to be recognized comprises:

8. A speech recognition apparatus, characterized in that the apparatus comprises:

9. An electronic device, comprising a processor and a memory, the memory storing a computer program that, when executed by the processor, performs the method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 7.