CN117558263B

CN117558263B - Speech recognition method, device, equipment and readable storage medium

Info

Publication number: CN117558263B
Application number: CN202410034818.9A
Authority: CN
Inventors: 马志强; 李永超; 孙磊
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2024-01-10
Filing date: 2024-01-10
Publication date: 2024-04-26
Anticipated expiration: 2044-01-10
Also published as: CN117558263A

Abstract

The application discloses a voice recognition method, a voice recognition device, voice recognition equipment and a readable storage medium. In the scheme, the end-to-end voice recognition model is subjected to field self-adaptive training in advance to obtain a multi-field voice recognition model and prompt vector parameters of each field, the prompt vector parameters of each field are used for indicating special voice recognition information of the field, after voice data to be recognized are obtained and acoustic feature sequences of the voice data to be recognized are determined, the prompt vector parameters of the field to which the voice data to be recognized belong are obtained, the prompt vector parameters and the acoustic feature sequences are input into the multi-field voice recognition model of the field, and the multi-field voice recognition model carries out encoding and decoding processing on the prompt vector parameters and the acoustic feature sequences to obtain recognition results of the voice data. The scheme can effectively ensure the recognition effect of the end-to-end voice recognition model in various fields.

Description

Speech recognition method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of speech processing technology, and more particularly, to a speech recognition method, apparatus, device, and readable storage medium.

Background

The traditional speech recognition model and the end-to-end speech recognition model are adopted in the current speech recognition model, wherein the traditional speech recognition model is independently modeled by an acoustic model and a language model, and the end-to-end speech recognition model has the advantage of combined modeling of acoustics and language, so that the speech recognition model is the most widely applied speech recognition model in the current stage.

The speech recognition application scenario includes many fields, such as education, medical treatment, vehicle-mounted, etc., so how to ensure the recognition effect of the end-to-end speech recognition model in each field is a technical problem to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of the above, the present application proposes a speech recognition method, apparatus, device, and readable storage medium. The specific scheme is as follows:

a method of speech recognition, the method comprising:

Acquiring voice data to be recognized;

Determining an acoustic feature sequence of the voice data to be recognized;

Acquiring prompt vector parameters of the field to which the voice data to be recognized belongs, wherein the prompt vector parameters are used for indicating voice recognition information special for the field;

Inputting the prompt vector parameters and the acoustic feature sequences into a multi-domain voice recognition model, wherein the multi-domain voice recognition model carries out encoding and decoding processing on the prompt vector parameters and the acoustic feature sequences to obtain recognition results of voice data, and the multi-domain voice recognition model and the prompt vector parameters are obtained through field self-adaptive training of an opposite-end-to-end voice recognition model.

Optionally, the multi-domain speech recognition model includes: an encoder and a decoder; the encoder comprises N encoding blocks, the decoder comprises N decoding blocks, and the encoding blocks and the decoding blocks comprise attention mechanism modules;

The multi-domain speech recognition model performs encoding and decoding processing on the prompt vector parameters and the acoustic feature sequence to obtain a recognition result of the speech data, and the method comprises the following steps:

The encoder carries out encoding processing based on the prompt vector parameters and the acoustic feature sequence, and the decoder carries out decoding processing based on the prompt vector parameters and the output of the encoder to obtain the recognition result of the voice data.

Optionally, the hint vector parameters include key hint vector parameters and value hint vector parameters, and the processing manner of each attention mechanism module includes:

determining a query vector parameter, a key vector parameter, and a value vector parameter;

splicing the key prompt vector parameters with the key vector parameters to obtain key splicing vector parameters;

Splicing the value prompt vector parameters and the value vector parameters to obtain value splicing vector parameters;

And calculating the output of the attention mechanism module based on the query vector parameter, the key splice vector parameter and the value splice vector parameter.

Optionally, the coding blocks include a first attention mechanism module, and the determining manner of the query vector parameter, the key vector parameter and the value vector parameter for each coding block includes:

Calculating a query vector parameter, a key vector parameter, and a value vector parameter based on the original input of the encoded block;

wherein the original input of the first coding block is the acoustic characteristic sequence of the voice data to be recognized, and the original input of other coding blocks except the first coding block is the output of the last coding block.

Optionally, the decoding block includes a second attention mechanism module and a third attention mechanism module, and for each attention mechanism module of the decoding block, the determining manner of the query vector parameter, the key vector parameter, and the value vector parameter includes:

calculating a query vector parameter, a key vector parameter, and a value vector parameter based on the original input of the attention mechanism module;

Wherein the original input of the second attention mechanism module of the first decoding block is a decoded text sequence, and the original decoded input of the third attention mechanism module is the output of the first attention mechanism module and the output of the encoder in the decoding block; the original input of the second attention mechanism module of the other decoding blocks except the first decoding block is the output of the last decoding block, and the original input of the third attention mechanism module is the output of the second attention mechanism module in the decoding block.

Optionally, the calculating the output of the attention mechanism module based on the query vector parameter, the key splice vector parameter, and the value splice vector parameter includes:

Matrix multiplying the query vector parameters and the key splicing vector parameters to obtain the weight of an attention mechanism;

and multiplying the weight of the attention mechanism by the value splicing vector parameter in a matrix manner to obtain the output of the attention mechanism module.

Optionally, the attention mechanism module is a single-head attention mechanism module or each attention mechanism layer in a multi-head attention mechanism module.

Optionally, the method for performing field adaptive training on the peer-to-peer voice recognition model includes:

acquiring voice recognition training data of each field and initial prompt vector parameters of each field; the voice recognition training data of each field comprises an acoustic characteristic sequence of training voice of the field and a text labeling sequence corresponding to the training voice;

inputting the acoustic feature sequences of the training voices in each field into an end-to-end voice recognition model, inputting the prompt vector parameters in each field into each attention mechanism module in the end-to-end voice recognition model, and obtaining the output result of the end-to-end voice recognition model;

Determining the prediction loss of the end-to-end voice recognition model according to the output result of the end-to-end voice recognition model and the text labeling sequence corresponding to the training voice;

And updating the prompt vector parameters of each field according to the prediction loss of the end-to-end voice recognition model, and obtaining the multi-field voice recognition model and the prompt vector parameters of each field after training.

A speech recognition device, the device comprising:

the voice data acquisition unit is used for acquiring voice data to be recognized;

An acoustic feature sequence determining unit, configured to determine an acoustic feature sequence of the voice data to be recognized;

A prompt vector parameter obtaining unit, configured to obtain a prompt vector parameter of a domain to which the voice data to be recognized belongs, where the prompt vector parameter is used to indicate voice recognition information specific to the domain;

The recognition unit is used for inputting the prompt vector parameters and the acoustic feature sequences into a multi-domain voice recognition model, the multi-domain voice recognition model carries out encoding and decoding processing on the prompt vector parameters and the acoustic feature sequences to obtain recognition results of voice data, and the multi-domain voice recognition model and the prompt vector parameters are obtained through domain self-adaptive training of an opposite-end-to-end voice recognition model.

the identification unit is specifically configured to:

Optionally, the hint vector parameters include key hint vector parameters and value hint vector parameters, and the attention mechanism module includes:

a determining unit configured to determine a query vector parameter, a key vector parameter, and a value vector parameter;

The first splicing unit is used for splicing the key prompt vector parameters and the key vector parameters to obtain key splice vector parameters;

the second splicing unit splices the value prompt vector parameters and the value vector parameters to obtain value splicing vector parameters;

And the calculating unit is used for calculating the output of the attention mechanism module based on the query vector parameter, the key splicing vector parameter and the value splicing vector parameter.

Optionally, the coding blocks include a first attention mechanism module, and the determining unit is specifically configured to:

Optionally, the decoding block includes a second attention mechanism module and a third attention mechanism module, and the determining unit is specifically configured to:

Optionally, the computing unit is specifically configured to:

A speech recognition device comprising a memory and a processor;

The memory is used for storing programs;

The processor is configured to execute the program to implement the steps of the speech recognition method as described above.

A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a speech recognition method as described above.

By means of the technical scheme, the application discloses a voice recognition method, a voice recognition device, voice recognition equipment and a readable storage medium. In the scheme, the end-to-end voice recognition model is subjected to field self-adaptive training in advance to obtain a multi-field voice recognition model and prompt vector parameters of each field, the prompt vector parameters of each field are used for indicating special voice recognition information of the field, after voice data to be recognized are obtained and acoustic feature sequences of the voice data to be recognized are determined, the prompt vector parameters of the field to which the voice data to be recognized belong are obtained, the prompt vector parameters and the acoustic feature sequences are input into the multi-field voice recognition model of the field, and the multi-field voice recognition model carries out encoding and decoding processing on the prompt vector parameters and the acoustic feature sequences to obtain recognition results of the voice data. The scheme can effectively ensure the recognition effect of the end-to-end voice recognition model in various fields.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a network structure of an attention mechanism module according to the present disclosure;

FIG. 3 is a flow chart of an implementation of the field adaptive training of an end-to-end speech recognition model disclosed in the present application;

FIG. 4 is a flow chart illustrating a processing manner of an attention mechanism module according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a calculation of an attention mechanism module according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a voice recognition device according to an embodiment of the present application;

fig. 7 is a block diagram of a hardware structure of a voice recognition device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In order to ensure the recognition effect of the end-to-end speech recognition model in various fields, a person skilled in the art mostly adopts a mode of performing field self-adaptive training on the end-to-end speech recognition model.

At present, two types of self-adaptive training schemes exist in the field of end-to-end speech recognition models, one type is model full-parameter fine tuning, namely, field training data is used for fine tuning training on a general end-to-end speech recognition model, full-parameter updating is performed, the training period of the scheme is long, generalization performance is poor, each field needs to train a brand new model, model training and deployment cost is increased, and the recognition effect of the general field can be influenced.

The other is fine tuning of the parameters of the output layer of the model, namely fine tuning training is carried out on a general end-to-end voice recognition model by using field training data, only the parameters of the output layer are updated, and the parameters of other layers are fixed.

Therefore, the current adaptive training scheme in the field of the end-to-end speech recognition model cannot effectively ensure the recognition effect of the end-to-end speech recognition model in each field.

The inventor of the present application researches and discovers that the following problems need to be considered in the self-adaptive training scheme in the end-to-end voice recognition model field:

first: and (5) model training and fine adjustment of parameter quantity.

The field adaptive training parameter number cannot be too large, a plurality of parameters cannot be additionally increased, the training time cannot be too long, the migration efficiency of the target field is guaranteed, the deployment cost needs to be considered, and a brand-new voice recognition model cannot be deployed in each field.

Second,: model primitive parameter training problem.

When the target field is trained, parameters and weights of the original model are not changed as much as possible, network parameters of the original model are kept fixed, high parameter sharing is achieved, the problem of catastrophic forgetting common to field self-adaptive training is reduced, the influence on the effect of a general scene is avoided, and the problem of over-fitting of the target scene is reduced.

Third,: domain scalability issues.

When a plurality of target fields need to be adapted, a group of independent model parameters but a small number of model parameters can be trained for each field respectively, so that the rapid expansion of different fields is realized.

Based on the thought, the inventor of the scheme conducts intensive research, and finally provides an end-to-end voice recognition model field self-adaptive training method based on Prompt Learning (Prompt Learning).

Prompt Learning (Prompt Learning) is a Learning method widely applied to the field of NLP (Natural Language Processing ), and changes a downstream task into a text generation task by adding "Prompt information" to an input without significantly changing the structure and parameters of a pre-trained language model. Unlike traditional supervised learning, which gives the probability p (y|x) of the input x, the predicted output y, prompt learning is based on a language model, and directly calculates the probability of the text. In order for these models to accomplish the predictive task, we convert the input x to x ' through templates (templates), x ' is equivalent to x being a number of words (tokens) that are cut away, leaving some slots (slots) to be filled, then fill x ' using language models, and finally derive the output y.

The self-adaptive training method for the end-to-end voice recognition model field based on Prompt Learning (Prompt Learning) provided by the inventor of the present application is specifically based on improvement of an attention mechanism module in the end-to-end voice recognition model based on Prompt Learning (Prompt Learning), and the recognition effect of the end-to-end voice recognition model in each field can be effectively ensured through lightweight fine adjustment.

Next, the end-to-end speech recognition model field adaptive training method and the speech recognition method provided by the present application are described by the following embodiments.

Referring to fig. 1, fig. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present application, where the method may include the following steps:

s101: and acquiring voice data to be recognized.

In the application, the voice data to be recognized can be voice data in any language, any duration and any field, and the application is not limited in any way.

S102: and determining an acoustic feature sequence of the voice data to be recognized.

In the application, the voice data to be recognized comprises a plurality of voice frames; the specific implementation manner of determining the acoustic feature sequence of the voice data to be recognized may be to determine acoustic features corresponding to each voice frame in the voice data to be recognized, so as to obtain the acoustic feature sequence of the voice data to be recognized.

The acoustic features may be common acoustic features such as PLP (Perceptual LINEAR PREDICTIVE, perceptual linear prediction coefficients), MFCC (Mel-scale Frequency Cepstral Coefficients, mel-frequency cepstral coefficients), filter Bank features, and as an implementation manner, the acoustic features in the present application may be Filter Bank features.

S103: and acquiring prompt vector parameters in the field to which the voice data to be recognized belong.

In the present application, the hint vector parameter is speech recognition information that is specific to the field. In different fields, the hint vector parameters are different. The hint vector parameters of each field may be one or more, and as an implementation, the hint vector parameters may include a Key (Key) hint vector parameter and a Value (Value) hint vector parameter. As an implementation manner, each hint vector parameter may include L hint vectors, where L is an integer greater than or equal to 1.

S104: inputting the prompt vector parameters and the acoustic feature sequences into a multi-domain voice recognition model, and performing encoding and decoding processing on the prompt vector parameters and the acoustic feature sequences by the multi-domain voice recognition model to obtain a recognition result of the voice data.

In the application, the multi-domain speech recognition model and the prompt vector parameters are obtained by performing domain self-adaptive training on the opposite-end-to-end speech recognition model. The manner in which the domain-adaptive training of the end-to-end speech recognition model is performed will be described in detail by the following examples. In the process of performing encoding and decoding processing on the prompt vector parameters and the acoustic feature sequences by the multi-domain speech recognition model to obtain the recognition result of the speech data, mainly, the attention mechanism module applies the prompt vector parameters to perform attention calculation, as an implementation manner, a part of attention mechanism modules may apply the prompt vector parameters, or all of attention mechanism modules may apply the prompt vector parameters, so that the application is not limited in any way, and considering the model effect, all of the attention mechanism modules may apply the prompt vector parameters, which will be specifically described in detail in the following embodiments.

The embodiment discloses a voice recognition method. In the scheme, the end-to-end voice recognition model is subjected to field self-adaptive training in advance to obtain a multi-field voice recognition model and prompt vector parameters of each field, the prompt vector parameters of each field are used for indicating special voice recognition information of the field, after voice data to be recognized are obtained and acoustic feature sequences of the voice data to be recognized are determined, the prompt vector parameters of the field to which the voice data to be recognized belong are obtained, the prompt vector parameters and the acoustic feature sequences are input into the multi-field voice recognition model of the field, and the multi-field voice recognition model carries out encoding and decoding processing on the prompt vector parameters and the acoustic feature sequences to obtain recognition results of the voice data. The scheme can effectively ensure the recognition effect of the end-to-end voice recognition model in various fields.

In the above embodiment, it is pointed out that the multi-domain speech recognition model and the prompt vector parameters are obtained by performing domain adaptive training on the end-to-end speech recognition model, and the structure of the end-to-end speech recognition model applied in the present application is briefly described.

The structure of the end-to-end speech recognition model employs an encoder-Decoder (Encoder-Decoder) framework, which may be transformer, conformer or the like, without limitation. As an implementation manner, in the present application, the encoder of the end-to-end speech recognition model includes N encoding blocks, where the decoder includes N decoding blocks, where N is an integer greater than 1, for example, N may take values of 12, 16, etc., which is not limited in any way. The encoding block and the decoding block both comprise attention mechanism modules, and common attention mechanisms include a single-head attention mechanism and a multi-head attention mechanism, and for an encoder and a decoder, if the attention mechanism adopted by the encoder and the decoder is the single-head attention mechanism, the attention mechanism module is the single-head attention mechanism module, and if the attention mechanism adopted by the encoder and the decoder is the multi-head attention mechanism, the attention mechanism module is each attention mechanism layer in the multi-head attention mechanism. In the conventional end-to-end speech recognition model, the input of each attention mechanism module is a Query (Query) vector parameter, a Key (Key) vector parameter, and a Value (Value) vector parameter.

For ease of understanding, referring to fig. 2, fig. 2 is a schematic diagram of a network structure of an attention mechanism module according to the present disclosure, and as shown in fig. 2, inputs are a Query (Query) vector parameter Q, a Key (Key) vector parameter K, and a Value (Value) vector parameter V. If the attention mechanism module is a single head attention mechanism module, it performs an attention calculation based on the following formula:

wherein, Is a training parameter.

If the attention mechanism module is an attention mechanism layer i in a multi-head attention mechanism, the attention mechanism module performs attention calculation based on the following formula:

where i represents the ith attention mechanism layer (head) within the multi-head attention mechanism, Is the training parameter corresponding to the i head.

After determining the end-to-end speech recognition model, the multi-domain speech recognition model may be obtained by performing domain adaptive training on the end-to-end speech recognition model, and in another embodiment of the present application, an implementation manner of performing domain adaptive training on the end-to-end speech recognition model is described in detail, as shown in fig. 3, which may include the following steps:

s201: acquiring voice recognition training data of each field and initial prompt vector parameters of each field; the speech recognition training data of each domain comprises an acoustic feature sequence of training speech of the domain and a text labeling sequence corresponding to the training speech.

The speech recognition training data of each field may include speech recognition training data of any of a plurality of fields, such as education, medical treatment, vehicle-mounted, and the like, to which the present application is not limited in any way.

Illustratively, the acoustic feature sequence and the text labeling sequence of a sentence of training speech data may be represented as follows:

Acoustic signature sequence x= [ X ₁,x₂,……,x_m,……,x_M ]

Text label sequence y= [ Y ₀,y₁,……,y_t,……,y_T ]

Wherein X _m represents an m-th frame acoustic feature vector in the acoustic feature sequence X, as an implementation manner, the application can use a Filter Bank feature of 40 dimensions, and K is the total number of voice frames; y _t represents the T-th character in the text label sequence Y, t+1 is the total number of characters of the total text label, where Y ₀ is the sentence start symbol "< s >", and Y _T is the sentence end symbol "</s >". Taking Chinese speech recognition as an example, and taking a single Chinese character as a modeling unit, training data can cover about 6700 of commonly used Chinese characters. Assuming that the text content of a sentence is "welcome to science big news fly", 8 Chinese characters are added, sentence beginning and sentence ending characters are added, and the text labeling sequence is 10 characters in total, the text labeling sequence Y= [ < s >, happy, welcome, come, arrived, science big news, fly, </s > ].

S202: and inputting the acoustic feature sequences of the training voices in each field into an end-to-end voice recognition model, inputting the initial prompt vector parameters in each field into each attention mechanism module in the end-to-end voice recognition model, and obtaining the output result of the end-to-end voice recognition model.

In the application, each acoustic feature sequence of the training voice corresponds to an initial prompt vector parameter, after the acoustic feature sequence of the training voice is input into the end-to-end voice recognition model, when the end-to-end voice recognition model processes the acoustic feature sequence of the training voice, each attention mechanism module in the end-to-end voice recognition model performs attention calculation by using the initial prompt vector parameter corresponding to the acoustic feature sequence of the training voice. The calculation mode is consistent with the calculation mode in the actual speech recognition scene, and the relevant content of the processing mode of the attention mechanism module in the following embodiment can be seen.

In training, as an implementation mode, one batch can contain training voice data of the same field, so that independent training of each field is realized. As another implementation mode, voice data for training in multiple fields can be contained in one batch, so that mixed training in each field can be realized.

S203: and determining the prediction loss of the end-to-end voice recognition model according to the output result of the end-to-end voice recognition model and the text labeling sequence corresponding to the training voice.

In the present application, the prediction loss of the end-to-end speech recognition model may be any loss, such as cross entropy loss, mean square error loss, etc., and may specifically be determined based on the scene requirement, and the present application is not limited in any way.

S204: and updating the initial prompt vector parameters of each field according to the prediction loss of the end-to-end voice recognition model, and obtaining the multi-field voice recognition model and the prompt vector parameters of each field after training is finished.

In the application, when the field self-adaptive training is carried out on the end-to-end voice recognition model, the original parameters of the end-to-end voice recognition model are kept unchanged, and the prompt vector parameters of each field are only updated, so that the multi-field voice recognition model and the prompt vector parameters of each field can be obtained after the training is finished.

In this embodiment, the method of performing field adaptive training on the end-to-end speech recognition model does not update the parameters of the original model of the end-to-end speech recognition model, and only the prompt vector parameters in the field need to be adjusted.

In addition, in this embodiment, the method of performing field adaptive training on the end-to-end speech recognition model does not need to deploy a brand-new speech recognition model in each field, and the recognition capability of the end-to-end speech recognition model in each field can be activated by training a group of independent but small number of model parameters (i.e., prompt vector parameters) for each field, so that the field expansibility of the end-to-end speech recognition model is improved, and the speech recognition effect of the end-to-end speech recognition model in each field is ensured.

The structure of the multi-domain speech recognition model is the same as that of the end-to-end speech recognition model, namely the multi-domain speech recognition model comprises: an encoder and a decoder; the encoder comprises N encoding blocks, the decoder comprises N decoding blocks, and the encoding blocks and the decoding blocks comprise attention mechanism modules; the multi-domain speech recognition model performs encoding and decoding processing on the prompt vector parameters and the acoustic feature sequence to obtain a recognition result of the speech data, including: the encoder carries out encoding processing based on the prompt vector parameters and the acoustic feature sequence, and the decoder carries out decoding processing based on the prompt vector parameters and the output of the encoder to obtain the recognition result of the voice data.

It should be noted that, in the present application, the distinction between the multi-domain speech recognition model and the end-to-end speech recognition model is only that, with respect to the input of the attention mechanism module of the end-to-end speech recognition model, in the multi-domain speech recognition model, the input of each attention mechanism module includes a Query (Query) vector parameter, a Key (Key) vector parameter, and a Value (Value) vector parameter, and a hint vector parameter is added. In the application, only the encoder performs encoding processing based on the prompt vector parameters and the acoustic feature sequence, and when the decoder performs decoding processing based on the prompt vector parameters and the output of the encoder, the processing mode of each attention mechanism module is described in detail, and the processing modes of the end-to-end speech recognition model can be referred to for other parts of processing, which is not repeated in the application.

The encoder performs encoding processing based on the hint vector parameters and the acoustic feature sequence, and when the decoder performs decoding processing based on the hint vector parameters and the output of the encoder, the processing manner of each attention mechanism module may include the following steps, as shown in fig. 4 specifically:

s301: a Query (Query) vector parameter, a Key (Key) vector parameter, and a Value (Value) vector parameter are determined.

In the present application, the manner in which different encoded or decoded blocks determine a Query (Query) vector parameter, a Key (Key) vector parameter, and a Value (Value) vector parameter is different. In one embodiment, one attention mechanism module is included in each encoding block, and two attention mechanism modules are included in each decoding block, specifically referring to the encoder and decoder structures of the transducer model. Assuming that the coding blocks include a first attention mechanism module, as an implementation manner, the determining manner of the Query (Query) vector parameter, the Key (Key) vector parameter and the Value (Value) vector parameter for each coding block includes: calculating a Query (Query) vector parameter, a Key (Key) vector parameter, and a Value (Value) vector parameter based on an original input of the encoded block; wherein the original input of the first coding block is the acoustic characteristic sequence of the voice data to be recognized, and the original input of other coding blocks except the first coding block is the output of the last coding block.

Assuming that the decoding block includes a second attention mechanism module and a third attention mechanism module, as another implementation manner, for each attention mechanism module of the decoding block, the determining manner of the Query (Query) vector parameter, the Key (Key) vector parameter, and the Value (Value) vector parameter includes: calculating a Query (Query) vector parameter, a Key (Key) vector parameter, and a Value (Value) vector parameter based on an original input of the attention mechanism module; wherein the original input of the second attention mechanism module of the first decoding block is a decoded text sequence, and the original decoded input of the third attention mechanism module is the output of the first attention mechanism module and the output of the encoder in the decoding block; the original input of the second attention mechanism module of the other decoding blocks except the first decoding block is the output of the last decoding block, and the original input of the third attention mechanism module is the output of the second attention mechanism module in the decoding block.

S302: and splicing the Key (Key) prompt vector parameter with the Key (Key) vector parameter to obtain a Key (Key) spliced vector parameter.

In the present application, as an implementation manner, the Key (Key) hint vector parameter is consistent with the dimension of the Key (Key) vector parameter, and the Key (Key) hint vector parameter may be spliced to the front of the Key (Key) vector parameter, so as to obtain a Key (Key) splice vector parameter.

S303: and splicing the Value (Value) prompt vector parameter with the Value (Value) vector parameter to obtain the Value (Value) spliced vector parameter.

In the present application, as an implementation manner, the Value (Value) hint vector parameter is consistent with the Value (Value) vector parameter dimension, and the Value (Value) hint vector parameter may be spliced to the front of the Value (Value) vector parameter, so as to obtain the Value (Value) spliced vector parameter.

S304: the output of the attention mechanism module is calculated based on the Query (Query) vector parameter, the Key (Key) splice vector parameter, and the Value (Value) splice vector parameter.

As one implementation, calculating the output of the attention mechanism module using an attention mechanism based on the Query (Query) vector parameter, the Key (Key) splice vector parameter, and the Value (Value) splice vector parameter includes: matrix multiplying the Query (Query) vector parameters with the Key (Key) spliced vector parameters to obtain the weight of an attention mechanism; and multiplying the weight of the attention mechanism by the Value (Value) spliced vector parameter in a matrix manner to obtain the output of the attention mechanism module.

In order to facilitate understanding of the processing manner of the attention mechanism module, assuming that the attention mechanism module is an attention mechanism layer i in the multi-head attention mechanism module, the following formula may be used to calculate the output of the attention mechanism module:

wherein, And/>Is the hint vector parameter corresponding to the attention mechanism layer i, and the dimension is/>The dimension of Q is，/>And/>Is/>Q and/>Weight of attention mechanism/>, obtained by Prompt MatMul matrix multiplicationIts dimension is/>Weight/>And/>The output of the attention mechanism module is obtained through Prompt MatMul matrix multiplication, and the dimension is/>. As shown in particular in fig. 5.

The voice recognition device disclosed in the embodiments of the present application will be described below, and the voice recognition device described below and the voice recognition method described above may be referred to correspondingly.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application. As shown in fig. 6, the voice recognition apparatus may include:

a voice data acquisition unit 11 for acquiring voice data to be recognized;

An acoustic feature sequence determining unit 12, configured to determine an acoustic feature sequence of the voice data to be recognized;

a prompt vector parameter obtaining unit 13, configured to obtain a prompt vector parameter of a domain to which the voice data to be recognized belongs, where the prompt vector parameter is used to indicate voice recognition information specific to the domain;

The recognition unit 14 is configured to input the prompt vector parameters and the acoustic feature sequences into a multi-domain speech recognition model, where the multi-domain speech recognition model performs encoding and decoding processing on the prompt vector parameters and the acoustic feature sequences to obtain recognition results of the speech data, and the multi-domain speech recognition model and the prompt vector parameters are obtained by performing domain adaptive training on an opposite-end-to-end speech recognition model.

As an implementation manner, the multi-domain speech recognition model includes: an encoder and a decoder; the encoder comprises N encoding blocks, the decoder comprises N decoding blocks, and the encoding blocks and the decoding blocks comprise attention mechanism modules;

the identification unit is specifically configured to:

As an implementation manner, the hint vector parameters include key hint vector parameters and value hint vector parameters, and the attention mechanism module includes:

As an implementation manner, the coding blocks include a first attention mechanism module, and the determining unit is specifically configured to:

As an implementation manner, the decoding block includes a second attention mechanism module and a third attention mechanism module, and the determining unit is specifically configured to, for each attention mechanism module of the decoding block:

As an embodiment, the computing unit is specifically configured to:

As an implementation manner, the attention mechanism module is a single-head attention mechanism module or each attention mechanism layer in a multi-head attention mechanism module.

As an implementation manner, the manner of performing the domain adaptive training on the peer-to-peer voice recognition model includes:

Referring to fig. 7, fig. 7 is a block diagram of a hardware structure of a voice recognition device according to an embodiment of the present application, and referring to fig. 7, the hardware structure of the voice recognition device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

In the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete the communication with each other through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an Application-specific integrated Circuit ASIC (Application SPECIFIC INTEGRATED Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;

wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:

Acquiring voice data to be recognized;

Determining an acoustic feature sequence of the voice data to be recognized;

Alternatively, the refinement function and the extension function of the program may be described with reference to the above.

The embodiment of the present application also provides a readable storage medium storing a program adapted to be executed by a processor, the program being configured to:

Acquiring voice data to be recognized;

Determining an acoustic feature sequence of the voice data to be recognized;

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech recognition, the method comprising:

Acquiring voice data to be recognized;

Determining an acoustic feature sequence of the voice data to be recognized;

Acquiring prompt vector parameters of the field to which the voice data to be recognized belongs, wherein the prompt vector parameters are used for indicating voice recognition information special for the field, and the prompt vector parameters comprise key prompt vector parameters and value prompt vector parameters;

Inputting the prompt vector parameters and the acoustic feature sequences into a multi-domain voice recognition model, wherein the multi-domain voice recognition model carries out encoding and decoding processing on the prompt vector parameters and the acoustic feature sequences to obtain recognition results of the voice data, and the multi-domain voice recognition model and the prompt vector parameters are obtained by carrying out domain self-adaptive training on an opposite-end voice recognition model;

And in the process of encoding and decoding the prompt vector parameters and the acoustic feature sequence by the multi-domain voice recognition model to obtain the recognition result of the voice data, the attention mechanism module applies the prompt vector parameters to perform attention calculation.

2. The method of claim 1, wherein the multi-domain speech recognition model comprises: an encoder and a decoder; the encoder comprises N encoding blocks, the decoder comprises N decoding blocks, and the encoding blocks and the decoding blocks comprise attention mechanism modules;

3. The method of claim 2, wherein each attention mechanism module is processed in a manner that includes:

4. A method according to claim 3, wherein the encoded blocks comprise a first attention mechanism module, and wherein the determination of the query vector parameter, key vector parameter and value vector parameter for each encoded block comprises:

5. A method according to claim 3, wherein the decoding block comprises a second attention mechanism module and a third attention mechanism module, and wherein for each attention mechanism module of the decoding block, the determining means of the query vector parameter, key vector parameter and value vector parameter comprises:

6. The method of claim 3, wherein the calculating the output of the attention mechanism module based on the query vector parameter, the key splice vector parameter, and the value splice vector parameter comprises:

7. The method of claim 2, wherein the attention mechanism module is a single-headed attention mechanism module or each attention mechanism layer in a multi-headed attention mechanism module.

8. The method of claim 1, wherein the means for domain-adaptive training of the peer-to-peer speech recognition model comprises:

9. A speech recognition device, the device comprising:

A prompt vector parameter obtaining unit, configured to obtain a prompt vector parameter in a domain to which the voice data to be recognized belongs, where the prompt vector parameter is used to indicate voice recognition information specific to the domain, and the prompt vector parameter includes a key prompt vector parameter and a value prompt vector parameter;

The recognition unit is used for inputting the prompt vector parameters and the acoustic feature sequences into a multi-domain voice recognition model, the multi-domain voice recognition model carries out encoding and decoding processing on the prompt vector parameters and the acoustic feature sequences to obtain recognition results of the voice data, and the multi-domain voice recognition model and the prompt vector parameters are obtained through carrying out domain self-adaptive training on an opposite-end-to-end voice recognition model; and in the process of encoding and decoding the prompt vector parameters and the acoustic feature sequence by the multi-domain voice recognition model to obtain the recognition result of the voice data, the attention mechanism module applies the prompt vector parameters to perform attention calculation.

10. A speech recognition device comprising a memory and a processor;

The memory is used for storing programs;

The processor is configured to execute the program to implement the respective steps of the speech recognition method according to any one of claims 1 to 8.

11. A readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the speech recognition method according to any one of claims 1 to 8.