CN117746864A

CN117746864A - Speech recognition method, model training method, device, equipment and storage medium

Info

Publication number: CN117746864A
Application number: CN202311871442.9A
Authority: CN
Inventors: 茆廷志; 万根顺; 高建清; 潘嘉; 刘聪; 奚昌凤; 王庆然
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-03-22

Abstract

The embodiment of the application discloses a voice recognition method, a model training method, a device, equipment and a storage medium, wherein voice data is encoded to obtain encoding characteristics of the voice data, and the encoding characteristics are decoded to obtain decoding characteristics; the decoding characteristics are used for determining a voice recognition result and a grammar classification result of voice data, and processing the decoding characteristics to obtain the voice recognition result. The decoding characteristics obtained by decoding the decoding characteristics can be used for speech recognition and grammar classification, namely, the process of encoding the speech data and the process of decoding the decoding characteristics take grammar knowledge into consideration, so that the accuracy of a speech recognition result is improved.

Description

Speech recognition method, model training method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a speech recognition method, a model training method, a device, equipment, and a storage medium.

Background

Speech recognition technology is a technology that converts a speech signal into text. Although the accuracy and efficiency of the current speech recognition system are significantly improved, the accuracy of speech recognition is still to be further improved due to the complexity and diversity of the speech signals.

Disclosure of Invention

In view of the foregoing, the present application provides a speech recognition method, a model training method, a device, an apparatus and a storage medium, so as to improve the accuracy of speech recognition.

In order to achieve the above object, the following solutions have been proposed:

a method of speech recognition, comprising:

encoding the voice data to obtain the encoding characteristics of the voice data;

decoding the coding features to obtain decoding features; the decoding characteristics are used for determining a voice recognition result and a grammar classification result of the voice data;

and processing the decoding characteristics to obtain the voice recognition result.

The method is characterized in that, optionally, voice data is encoded, the encoding features are decoded, and the processing of the decoding features is realized through a voice recognition model;

the voice recognition model is obtained by jointly training a voice recognition task and a grammar classification task; the voice recognition task is realized through the voice recognition model.

The method, optionally, the process of jointly training the speech recognition task and the grammar classification task includes:

encoding the voice sample through the voice recognition model to obtain the encoding characteristic of the voice sample, decoding the encoding characteristic of the voice sample to obtain the decoding characteristic of the voice sample, and processing the decoding characteristic of the voice sample to obtain the voice recognition result of the voice sample;

Processing at least the decoding characteristics of the voice sample through a grammar classification network to obtain a grammar classification result of the voice sample;

and updating parameters of the voice recognition model and parameters of the grammar classification network by taking the aim that the voice recognition result of the voice sample approaches to the voice recognition label of the voice sample and the grammar classification result approaches to the grammar label of the voice sample.

The method, optionally, processes at least the decoded features of the speech samples through a grammar classification network, including:

mapping a voice recognition result of the voice sample into an embedded feature through an embedded model;

fusing the embedded features with the decoding features of the voice samples to obtain fusion features;

inputting the fusion features into a pre-trained grammar classification network to obtain grammar classification results of the voice samples output by the pre-trained grammar classification network.

In the above method, optionally, the updating the parameters of the speech recognition model and the parameters of the pre-trained grammar classification network with the objective that the speech recognition result of the speech sample approaches the speech recognition label of the speech sample and the grammar classification result approaches the grammar label of the speech sample includes:

And updating parameters of the voice recognition model, parameters of the pre-trained grammar classification network and parameters of the embedded model by taking the voice recognition result of the voice sample approaching to the voice recognition label of the voice sample and the grammar classification result approaching to the grammar label of the voice sample as targets.

performing linear processing on the decoding characteristics of the voice sample through a linear processing module of the grammar classification network to obtain a linear processing result of the decoding characteristics of the voice sample;

and classifying the linear processing result through a classifying module of the grammar classifying network to obtain a grammar classifying result of the voice sample output by the grammar classifying network.

The above method, optionally, carries out linear processing on the decoding features of the speech samples by using a linear processing module of the grammar classification network, and carries out a process of classifying processing on the linear processing results by using a classifying module of the grammar classification network, including:

performing first linear processing on the decoding characteristics of the voice sample through a first linear processing module to obtain a first linear processing result; performing first classification processing on the first linear processing result through a first classification module to obtain a part-of-speech classification result of the voice sample;

And/or the number of the groups of groups,

performing second linear processing on the decoding characteristics of the voice sample through a second linear processing module to obtain a second linear processing result; and performing second classification processing on the second linear processing result through a second classification module to obtain a dependency syntax classification result of the voice sample.

A speech recognition model training method, comprising:

A speech recognition apparatus comprising:

The coding module is used for coding the voice data to obtain coding characteristics of the voice data;

the decoding module is used for decoding the coding features to obtain decoding features; the decoding characteristics are used for determining a voice recognition result and a grammar classification result of the voice data;

and the processing module is used for processing the decoding characteristics to obtain the voice recognition result.

A speech recognition model training apparatus comprising:

the recognition module is used for encoding the voice sample through the voice recognition model to obtain the encoding characteristic of the voice sample, decoding the encoding characteristic of the voice sample to obtain the decoding characteristic of the voice sample, and processing the decoding characteristic of the voice sample to obtain the voice recognition result of the voice sample;

the classification module is used for processing at least the decoding characteristics of the voice sample through a grammar classification network to obtain a grammar classification result of the voice sample;

and the updating module is used for updating the parameters of the voice recognition model and the parameters of the grammar classification network by taking the aim that the voice recognition result of the voice sample approaches to the voice recognition label of the voice sample and the grammar classification result approaches to the grammar label of the voice sample.

A speech processing apparatus comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the speech recognition method according to any one of the above, and/or to implement the steps of the speech recognition model training method as described above.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech recognition method according to any one of the preceding claims and/or the steps of the speech recognition model training method as described above.

From the above technical solution, it can be seen that the voice recognition method, the model training method, the device, the equipment and the storage medium provided in the embodiments of the present application encode voice data to obtain encoding features of the voice data, and decode the encoding features to obtain decoding features; the decoding characteristics are used for determining a voice recognition result and a grammar classification result of voice data, and processing the decoding characteristics to obtain the voice recognition result. The decoding characteristics obtained by decoding the decoding characteristics can be used for speech recognition and grammar classification, namely, the process of encoding the speech data and the process of decoding the decoding characteristics take grammar knowledge into consideration, so that the accuracy of a speech recognition result is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flowchart of one implementation of a speech recognition method disclosed in an embodiment of the present application;

FIG. 2 is a flow chart of one implementation of the joint training of speech recognition tasks and grammar classification tasks disclosed in embodiments of the present application;

FIG. 3a is a flow chart of one implementation of processing at least decoded features of a speech sample through a grammar classification network as disclosed in an embodiment of the present application;

FIG. 3b is a diagram of a system architecture for joint training of speech recognition tasks and grammar classification tasks as disclosed in embodiments of the present application;

FIG. 3c is a diagram of another system architecture for joint training of speech recognition tasks and grammar classification tasks as disclosed in embodiments of the present application;

FIG. 4a is a flowchart of another implementation of processing at least decoded features of a speech sample through a grammar classification network as disclosed in an embodiment of the present application;

FIG. 4b is a diagram of yet another system architecture for joint training of speech recognition tasks and grammar classification tasks as disclosed in embodiments of the present application;

FIG. 4c is a diagram of yet another system architecture for joint training of speech recognition tasks and grammar classification tasks as disclosed in embodiments of the present application;

FIG. 4d is a diagram of yet another system architecture for joint training of speech recognition tasks and grammar classification tasks as disclosed in embodiments of the present application;

FIG. 5 is a schematic diagram of a voice recognition device according to an embodiment of the present disclosure;

fig. 6 is a block diagram of a hardware structure of a speech processing device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Along with the development of deep learning technology, the accuracy of the voice recognition system is remarkably improved, and more voice recognition systems also support multiple languages, such as Chinese, english, german, spanish, arabic and the like. Therefore, the voice recognition technology has wider application prospect in internationalization and cross-cultural communication. Application of speech recognition technology besides traditional fields such as smart home, smart customer service, smart medical treatment, etc., speech recognition technology is beginning to be applied in more fields such as smart driving, smart education, smart finance, etc. In a word, the speech recognition technology is significantly improved and expanded in terms of accuracy, speed, application range and the like. Along with the continuous development of technology and the continuous expansion of application scenes, the voice recognition technology can be widely applied in more fields, and brings more convenient and efficient living experience for people.

However, the accuracy and efficiency of current speech recognition systems are significantly improved. However, due to the complexity and diversity of speech signals, speech recognition techniques still have certain limitations. For example, different speakers may have different accents, speech speeds, and intonation, all of which affect the accuracy of speech recognition, resulting in a user-unfriendly subjective experience of the recognition results. In addition, in some scenes, although the overall recognition accuracy of the speech recognition reaches more than 90%, there are still many problems in the recognition result, such as: the accuracy of the whole sentence recognition is high, but the recognition error of partial words can also cause the user to evaluate the recognition result to be lower. Because even if there is only one word recognition error, the meaning and expression effect of the sentence may be affected. This may also create certain risks and effects for some situations where high accuracy is required, such as in the medical, legal, etc. fields.

In order to improve the accuracy of voice recognition, the scheme of the application is provided.

As shown in fig. 1, a flowchart for implementing a voice recognition method according to an embodiment of the present application may include:

Step S101: and encoding the voice data to obtain the encoding characteristics of the voice data.

The coding feature is a hidden layer feature of the speech data.

Step S102: decoding the coding feature to obtain a decoding feature; the decoding features are used to determine speech recognition results and grammar classification results for the speech data.

The decoding features are also hidden layer features that can be used for grammar classification in addition to speech recognition. That is, the present application takes into account the grammar knowledge of the process of encoding speech data, as well as the process of decoding the decoding features.

Step S103: and processing the decoding characteristics to obtain a voice recognition result.

Since the decoding features can be used for grammar classification of the voice data, the voice recognition result obtained by processing the decoding features is more accordant with grammar knowledge.

According to the voice recognition method, grammar knowledge is considered in the process of encoding voice data and the process of decoding the decoding characteristics, so that the voice recognition result accords with the grammar knowledge more, and the accuracy of the voice recognition result is improved.

In an alternative embodiment, the above-mentioned process of encoding voice data, decoding the encoded features, and processing the decoded features may be implemented by a voice recognition model.

Optionally, acoustic feature extraction may be performed on the voice data to obtain acoustic features of each voice frame of the voice data.

And inputting the acoustic characteristics of each voice frame into a voice recognition model to obtain a voice recognition result output by the voice recognition model. Wherein,

the voice recognition model is used for encoding the acoustic features of each voice frame to obtain the encoding features of each voice frame, decoding the acoustic features of each voice frame to obtain the decoding features of voice data, and processing the decoding features to obtain the voice recognition result.

Optionally, the speech recognition model may include an encoding module, a decoding module, and a recognition module; wherein,

the encoding module encoder is used for encoding voice data (specifically, acoustic features of each voice frame) to obtain encoding features of the voice data (namely, encoding features of each voice frame).

The decoding module decoder is used for decoding the coding features of the voice data to obtain decoding features of the voice data.

And the recognition module is used for processing the decoding characteristics of the voice data to obtain a voice recognition result. Alternatively, the identification module may include: a linear layer for performing linear transformation on the decoding characteristics of the voice data; and the activation layer is used for activating the linear transformation result output by the linear layer to obtain a voice recognition result. As an example, the Linear layer may perform Linear transformation on the decoded feature of the voice data using a Linear function, and the active layer may use a GELU (Gaussian Error Linear Unit) active function, or a Softmax active function, or a ReLU (Rectified Linear Units) active function, etc., which is not particularly limited in the embodiment of the present invention.

Optionally, the speech recognition model is obtained by jointly training a speech recognition task and a grammar classification task. The voice recognition task is realized through a voice recognition model, and the grammar classification task can be realized through a grammar classification network.

After the combined training of the voice recognition task and the grammar classification task is completed, only a voice recognition model is needed when the voice recognition is carried out, and the participation of a grammar classification network is not needed.

In an alternative embodiment, a flowchart for implementing the above-mentioned joint training of the speech recognition task and the grammar classification task is shown in fig. 2, and may include:

step S201: performing voice recognition on the voice sample through the voice recognition model, specifically comprising: and encoding the voice sample through the voice recognition model to obtain the encoding characteristic of the voice sample, decoding the encoding characteristic of the voice sample to obtain the decoding characteristic of the voice sample, and processing the decoding characteristic of the voice sample to obtain the voice recognition result of the voice sample.

Step S202: and processing at least the decoding characteristics of the voice sample through a grammar classification network to obtain a grammar classification result of the voice sample.

In this embodiment of the present application, the grammar classification network may perform grammar classification on the voice sample only based on the decoding features of the voice sample, or may perform grammar classification on the voice sample based on the decoding features of the voice sample and the voice recognition result.

The grammar classification herein may include, but is not limited to, at least one of the following: part of speech classification, dependency syntax classification, etc.

The part of speech of a word describes the grammatical role that the word plays in a sentence. For example, in the sentence "i am watching tv," i am a pronoun, "i am" an adverb, "watch" a verb, and "tv" an noun. Through part-of-speech analysis of words in sentences in the joint training process, the relation among the words can be better understood by the trained voice recognition model, and the accuracy of voice recognition is improved.

The dependency syntax describes the dependencies between words in a sentence. Also taking the sentence "I are watching TV" as an example, in this sentence there is a dependency relationship between "watching" and "watching" indicating that "watching" is a modifier. By performing dependency syntax analysis on sentences in the joint training process, the trained speech recognition model can better understand the relation between words, thereby improving the accuracy and consistency of speech recognition

Step S203: parameter updating is carried out based on a voice recognition result and a grammar classification result of a voice sample, and the method comprises the following steps: and updating parameters of the voice recognition model and parameters of the grammar classification network by taking the voice recognition result of the voice sample approaching to the voice recognition label of the voice sample and the grammar classification result approaching to the grammar label of the voice sample as targets.

As an example, a first difference between a speech recognition result and a speech recognition tag may be calculated through a first loss function, a second difference between a grammar classification result and a grammar tag may be calculated through a second loss function, the first difference and the second difference may be fused to obtain a comprehensive difference, and parameters of the speech recognition model and parameters of the grammar classification network may be updated with the goal that the comprehensive difference is smaller and smaller.

The training process of constraint voice recognition through grammar knowledge is realized by jointly training the voice recognition model and the grammar classification network.

In an alternative embodiment, a flowchart of an implementation of processing at least the decoded features of the speech samples through the grammar classification network as shown in fig. 3a may include:

step S301: the speech recognition result of the speech sample is mapped to an embedding feature (embedding) by an embedding model.

According to the method and the device, the embedded features of the voice recognition result are extracted through the embedded model, and the embedded features of the voice recognition result are obtained.

Step S302: and fusing the embedded features with the decoding features of the voice samples to obtain fusion features.

The embedded features may be spliced with the decoded features to obtain fusion features. Or,

The embedded features may be added to the decoded features to obtain fusion features.

Step S303: inputting the fusion features into a pre-trained grammar classification network to obtain grammar classification results of voice samples output by the pre-trained grammar classification network.

In the application, the grammar classification network is trained in combination with the speech recognition model after being trained in advance through text and grammar labels. Optionally, the process of pre-training the grammar classification network may include:

inputting the text into a grammar classifying network to obtain grammar classifying results output by the grammar classifying network.

And updating parameters of the grammar classification network by taking grammar classification results output by the grammar classification network as targets which approach grammar labels corresponding to texts.

Grammar tags may include, but are not limited to, at least one of the following: part of speech tags, dependency syntax tags.

As an example, the grammar classification network may be a transformerlencoder.

Correspondingly, the method for updating the parameters of the voice recognition model and the parameters of the pre-trained grammar classification network by taking the voice recognition result of the voice sample approaching to the voice recognition label of the voice sample and the grammar classification result approaching to the grammar label of the voice sample as targets may be as follows:

And updating parameters of a voice recognition model, parameters of a pre-trained grammar classification network and parameters of an embedded model by taking the voice recognition result of the voice sample approaching to a voice recognition label of the voice sample and the grammar classification result approaching to a grammar label of the voice sample as targets.

As shown in fig. 3b, a system architecture diagram for joint training of speech recognition tasks and grammar classification tasks is provided in an embodiment of the present application.

In this example, acoustic feature extraction is performed on voice data to obtain acoustic features, the acoustic features are input into a voice recognition model, the acoustic features are encoded by an encoding module of the voice recognition model to obtain encoded features, the encoded features are decoded by a decoding module to obtain decoded features, the decoded features are recognized by a recognition module to obtain a voice recognition result, and the voice recognition result and a first Loss of a voice recognition tag (denoted as Loss _asr ). The first loss may be a cross entropy loss.

Extracting embedded features from the final transcribed text determined based on the speech recognition result by using an embedded model to obtain embedded features, fusing the embedded features with decoding features output by a decoding module to obtain fused features, performing grammar classification on the fused features by using a pre-trained grammar classification network to obtain a grammar classification result, and calculating a second Loss (denoted as Loss _t2g ). The second loss may be a cross entropy loss.

And carrying out weighted summation on the first loss and the second loss to obtain the comprehensive loss of the combined training. The formula can be expressed as:

Loss _final ＝ Loss _asr +θLoss _t2g (1)

wherein, loss _final Comprehensive loss for joint training; θ is the weight of the second penalty, which may be a positive number less than 1.

And updating parameters of the voice recognition model, parameters of the pre-trained grammar classification network and parameters of the embedded model by taking the aim of smaller and smaller comprehensive loss of the combined training.

The weights of the first loss and the second loss may be the same or different. As an example, the weight of the first penalty is greater than the weight of the second penalty.

As shown in fig. 3c, another system architecture diagram for joint training of speech recognition tasks and grammar classification tasks is provided in an embodiment of the present application.

The system architecture shown in fig. 3c differs from the system architecture shown in fig. 3b in that the speech recognition model has two speech recognition branches, one of which is identical to the speech recognition path shown in fig. 3b, and the corresponding speech recognition result and the third Loss of the speech recognition tag (denoted as Loss ₁ ). The third loss may be a cross entropy loss; the other voice recognition branch carries out recognition processing on the coding features output by the coding module by the second recognition module to obtain a voice recognition result, correspondingly, when calculating the voice recognition Loss, the voice recognition result of the newly added branch and the fourth Loss of the voice recognition tag (recorded as Loss ₂ ) The fourth loss may be CTC loss.

The total Loss of the speech recognition model is the third Loss ₁ And fourth Loss of Loss ₂ Is a weighted sum of (c). The formula can be expressed as:

Loss _{asr_total} ＝λLoss ₁ +(1-λ)Loss ₂ (2)

wherein, loss _{asr_total} Is the total loss of the speech recognition model.

The comprehensive loss of the combined training is as follows: total Loss and second Loss of speech recognition model _t2g Is a weighted sum of (c). The formula can be expressed as:

Loss _final ＝ Loss _{asr_total} +θLoss _t2g (3)

In an alternative embodiment, another implementation flowchart for processing at least the decoded features of the speech samples through the grammar classification network as shown in fig. 4a may include:

step S401: and carrying out linear processing on the decoding characteristics of the voice samples by a linear processing module of the grammar classification network to obtain a linear processing result of the decoding characteristics of the voice samples.

The Linear processing module may use a Linear function to linearly process the decoded features of the speech samples.

Step S402: and classifying the linear processing result through a classifying module of the grammar classifying network to obtain a grammar classifying result of the voice sample obtained by the grammar classifying network.

In the embodiment of the application, the grammar classification network shares the coding parameters and decoding parameters of the voice recognition model, and pre-training of the grammar classification network is not needed.

In an optional embodiment, the above-mentioned linear processing module of the grammar classification network performs linear processing on the decoded feature of the speech sample, and one implementation manner of performing the classification processing on the linear processing result by the classification module of the grammar classification network may be:

and performing first linear processing on the decoding characteristics of the voice sample through a first linear processing module to obtain a first linear processing result. As an example, the first Linear processing module may perform a first Linear processing on the decoded features of the speech samples using a Linear function.

And performing first classification processing on the first linear processing result through a first classification module to obtain a part-of-speech classification result of the voice sample. As an example, the first classification module may perform a first classification process (i.e., an activation process) on the first linear processing result using an activation function. The activation function may include, but is not limited to, any of the following: a GELU activation function, a Softmax activation function, a ReLU activation function, etc.

As shown in fig. 4b, another system architecture diagram for joint training of speech recognition tasks and grammar classification tasks is provided in an embodiment of the present application. Unlike the example shown in FIG. 3b, in this example, the grammar classification task shares an encoding module and a decoding module with the speech recognition task Parameters of the block, wherein Loss of speech recognition model _asr The calculation of (a) may refer to the example shown in fig. 3b, which is not repeated here, and the loss of the grammar classification network is: the part-of-speech classification result output by the first classification module and the fifth Loss of part-of-speech tags (denoted as Loss _pos ). Based on this, the combined loss of joint training is: first Loss of Loss _asr And a fifth Loss _pos Is a weighted sum of (c). The weights of the first loss and the fifth loss may be the same or different. The fifth loss may be a cross entropy loss.

Of course, in fig. 4b, the structure of the speech recognition model may be the structure of the speech recognition model as in fig. 3c, and the total Loss of the corresponding speech recognition model _{asr_total} The calculation of (2) can be referred to in formula (2), and will not be described in detail here.

In an optional embodiment, the above-mentioned linear processing module of the grammar classification network performs linear processing on the decoded feature of the speech sample, and another implementation manner of performing the classification processing on the linear processing result by the classification module of the grammar classification network may be:

and performing second linear processing on the decoding characteristics of the voice sample through a second linear processing module to obtain a second linear processing result. As an example, the second Linear processing module may perform a second Linear processing on the decoded features of the speech samples using a Linear function.

And performing second classification processing on the second linear processing result through a second classification module to obtain a dependency syntax classification result of the voice sample.

The composition of the dependency syntax mainly includes two parts, namely an edge (denoted as arc) and a head (denoted as head), based on which the second classification module may include a first sub-classification module and a second sub-classification module; wherein,

the first sub-classification module is used for performing first sub-classification processing on the second linear processing result to obtain an edge classification result of the dependency syntax. Optionally, the first sub-classification module may include: the first multi-layer perceptron, the first bilinear layer connected with the first multi-layer perceptron, the first activation layer connected with the first bilinear layer. The first multi-layer perceptron is used to model edges of the dependency syntax and the first bilinear layer is used to implement interactions.

And the second sub-classification module is used for performing second sub-classification processing on the second linear processing result to obtain a head classification result of the dependency syntax. Optionally, the second sub-classification module may include: and a second multi-layer perceptron, a second bilinear layer coupled to the second multi-layer perceptron, and a second activation layer coupled to the second bilinear layer. The second multi-layer perceptron is used to model the header of the dependency syntax and the second bilinear layer is used to implement interactions.

As shown in fig. 4c, another system architecture diagram for joint training of speech recognition tasks and grammar classification tasks is provided in an embodiment of the present application. In this example, the first sub-classification module includes a first multi-layer perceptron MLP1, a first Bilinear layer Bilinear1, and a first activation layer, and the second sub-classification module includes a second multi-layer perceptron MLP2, a second Bilinear layer 2, and a second activation layer.

Total penalty of grammar classification network by sixth penalty Loss calculated based on edge classification result and edge label _arc And a seventh Loss calculated based on the header classification result and the header label _head And (5) obtaining weighted summation. The sum of the weights of the sixth loss and the seventh loss is 1. The total penalty of a grammar classification network can be formulated as:

Loss _dp ＝γLoss _arc +(1-γ)Loss _head (4)

wherein, loss _dp The total penalty of the network is classified for grammar.

The sixth loss and the seventh loss may both be cross entropy losses.

The total penalty for joint training is then a weighted sum of the penalty for the speech recognition task (see previous embodiments) and the total penalty for the grammar classification network.

Of course, in fig. 4c, the structure of the speech recognition model may be the structure of the speech recognition model as in fig. 3c, and the total Loss of the corresponding speech recognition model _{asr_total} The calculation of (2) can be referred to in formula (2), and will not be described in detail here.

and performing first linear processing on the decoding characteristics of the voice sample through a first linear processing module to obtain a first linear processing result. And performing first classification processing on the first linear processing result through a first classification module to obtain a part-of-speech classification result of the voice sample.

And performing second linear processing on the decoding characteristics of the voice sample through a second linear processing module to obtain a second linear processing result. And performing second classification processing on the second linear processing result through a second classification module to obtain a dependency syntax classification result of the voice sample.

As shown in fig. 4d, another system architecture diagram for joint training of speech recognition tasks and grammar classification tasks is provided in an embodiment of the present application. The grammar classification network in this example performs both part-of-speech classification and dependency syntax classification. Wherein Loss of speech recognition task Loss _asr Loss of part-of-speech classification task Loss _pos Loss of dependent syntax tasks Loss _dp The foregoing embodiments may be referred to, and will not be repeated here, where the calculation manner of the comprehensive loss of the joint training is mainly described: loss of part-of-speech classification tasks can be reduced _pos Loss of dependent syntax tasks _dp Weighting and summing to obtain total Loss of grammar classification task and Loss of speech recognition task _asr Summing the total loss of the grammar classification task to obtain the total loss of the joint training. The formula can be expressed as:

Loss _final ＝ Loss _asr +δLoss _pos +(1-δ)Loss _dp (5)

wherein, loss _final Is the comprehensive loss of the joint training.

Corresponding to the method embodiment, the present application further provides a voice recognition device, and a schematic structural diagram of the voice recognition device provided in the embodiment of the present application is shown in fig. 5, which may include:

an encoding module 501, a decoding module 502 and a processing module 503; wherein,

the encoding module 501 is configured to encode the voice data to obtain an encoding feature of the voice data;

the decoding module 502 is configured to decode the encoded feature to obtain a decoded feature; the decoding characteristics are used for determining a voice recognition result and a grammar classification result of the voice data;

the processing module 503 is configured to process the decoded feature to obtain the speech recognition result.

According to the voice recognition device provided by the embodiment of the application, the decoding characteristics obtained by decoding the decoding characteristics can be used for voice recognition and grammar classification, namely, the process of encoding voice data and the process of decoding the decoding characteristics take grammar knowledge into consideration, so that the voice recognition result accords with the grammar knowledge more, and the accuracy of the voice recognition result is improved.

In an alternative embodiment, the speech recognition device comprises a speech recognition model, and the encoding module 501, the decoding module 502 and the processing module 503 belong to the speech recognition model. The voice recognition model is obtained by jointly training a voice recognition task and a grammar classification task; the voice recognition task is realized through the voice recognition model.

In an alternative embodiment, the voice recognition device further comprises a training module for:

In an alternative embodiment, the training module is configured to, when processing at least the decoded features of the speech samples through a grammar classification network:

In an alternative embodiment, the training module is configured to update the parameters of the speech recognition model and the parameters of the pre-trained grammar classification network by targeting the approach of the speech recognition result of the speech sample to the speech recognition label of the speech sample, and the approach of the grammar classification result to the grammar label of the speech sample:

In an alternative embodiment, the training module performs linear processing on the decoded feature of the speech sample through a linear processing module of the grammar classification network, and when performing classification processing on the linear processing result through a classification module of the grammar classification network, the training module is configured to:

and/or the number of the groups of groups,

Corresponding to the method embodiment, the embodiment of the application also provides a voice recognition model training device, and the voice recognition model training device provided by the embodiment of the application may include:

Based on the speech recognition model obtained by training the speech recognition model training device, the decoding characteristics obtained by decoding the decoding characteristics can be used for speech recognition and grammar classification, that is, the speech recognition model obtained based on the training scheme of the application codes speech data and considers grammar knowledge when decoding the decoding characteristics, so that the speech recognition result accords with the grammar knowledge more, and the accuracy of the speech recognition model is improved.

The voice recognition device and/or the voice recognition model training device provided by the embodiment of the application can be applied to voice processing equipment, such as a PC terminal, a robot, a cloud platform, a server cluster and the like. Alternatively, fig. 6 shows a block diagram of a hardware structure of a voice processing apparatus, and referring to fig. 6, the hardware structure of the voice processing apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

In the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete communication with each other through the communication bus 4;

processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;

wherein the memory stores a program, the processor may call the program stored in the memory,

the program is for: encoding the voice data to obtain the encoding characteristics of the voice data; decoding the coding features to obtain decoding features; the decoding characteristics are used for determining a voice recognition result and a grammar classification result of the voice data; and processing the decoding characteristics to obtain the voice recognition result.

And/or the number of the groups of groups,

the program is used for coding the voice sample through the voice recognition model to obtain coding features of the voice sample, decoding the coding features of the voice sample to obtain decoding features of the voice sample, and processing the decoding features of the voice sample to obtain a voice recognition result of the voice sample; processing at least the decoding characteristics of the voice sample through a grammar classification network to obtain a grammar classification result of the voice sample; and updating parameters of the voice recognition model and parameters of the grammar classification network by taking the aim that the voice recognition result of the voice sample approaches to the voice recognition label of the voice sample and the grammar classification result approaches to the grammar label of the voice sample.

Alternatively, the refinement function and the extension function of the program may be described with reference to the above.

The present embodiment also provides a storage medium, which may store a program adapted to be executed by a processor,

And/or the number of the groups of groups,

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech recognition, comprising:

2. The method of claim 1, wherein the speech data is encoded, the encoded features are decoded, and the processing of the decoded features is performed by a speech recognition model;

3. The method of claim 2, wherein the process of co-training the speech recognition task and the grammar classification task comprises:

4. A method according to claim 3, wherein said processing at least the decoded features of the speech samples through a grammar classification network comprises:

5. The method of claim 4, wherein the updating parameters of the speech recognition model and parameters of the pre-trained grammar classification network with the goal of the speech recognition result of the speech sample approaching a speech recognition tag of the speech sample, the grammar classification result approaching a grammar tag of the speech sample, comprises:

6. A method according to claim 3, wherein said processing at least the decoded features of the speech samples through a grammar classification network comprises:

7. The method of claim 6, wherein the step of performing linear processing on the decoded features of the speech samples by a linear processing module of the grammar classification network and performing classification processing on the linear processing results by a classification module of the grammar classification network comprises:

And/or the number of the groups of groups,

8. A method for training a speech recognition model, comprising:

9. A speech recognition apparatus, comprising:

10. A speech recognition model training device, comprising:

11. A speech processing device comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the speech recognition method according to any one of claims 1-7, and/or the steps of the speech recognition model training method according to claim 8.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the speech recognition method according to any one of claims 1-7 and/or the individual steps of the speech recognition model training method according to claim 8.