CN109461438B

CN109461438B - Voice recognition method, device, equipment and storage medium

Info

Publication number: CN109461438B
Application number: CN201811556515.4A
Authority: CN
Inventors: 方昕; 刘海波; 汪睿; 方磊
Original assignee: Hefei Ustc Iflytek Co ltd
Current assignee: Hefei Ustc Iflytek Co ltd
Priority date: 2018-12-19
Filing date: 2018-12-19
Publication date: 2022-06-14
Anticipated expiration: 2038-12-19
Also published as: CN109461438A

Abstract

The application provides a voice recognition method, a voice recognition device, equipment and a storage medium, wherein the method comprises the following steps: extracting voice features of voice data to be recognized; determining attribute information of voice content of the voice data to be recognized according to the voice characteristics; and determining the voice content of the voice data to be recognized according to the voice characteristics and the attribute information of the voice content of the voice data to be recognized. The content obtained by the recognition of the voice recognition processing process comprises the attribute information of the voice content and the specific information of the voice content, so that the phenomenon of recognition confusion caused by the fact that the attributes of the voice content cannot be distinguished can be effectively avoided, and the voice recognition accuracy is improved.

Description

Voice recognition method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, device, and storage medium.

Background

The end-to-end speech recognition model can recognize single words and phrases of input speech data and combine the recognized single words and phrases to obtain whole words or sentences. For example, a chinese speech recognition model often uses chinese characters or words as a modeling unit, recognizes chinese characters or words included in input speech data by learning a correspondence between the input speech data and output chinese characters or words, and then combines the recognized chinese characters or words to obtain a recognized text.

Some languages have attribute characteristics such as single words or words, and are used to specify their usage when they constitute a whole word or sentence. Even a single word or a word in some languages has multiple attributes, for example, in many languages of the glue language, some sub-words as sentence elements can be used as stems and affixes, and when the attributes of the sub-words as stems or affixes are different, the lattice relationship between the sub-words and adjacent sub-words is different, and the formed whole word or sentence is also different. The end-to-end speech recognition model simply establishes the corresponding relationship between the speech data and the single characters or the word contents, and directly splices the recognized single characters or words to obtain the recognition result. When training is insufficient, phenomena of single character or word recognition confusion and wrong recognition result splicing often occur, and therefore voice recognition is inaccurate.

Disclosure of Invention

Based on the problems in the existing voice recognition technology, the application provides a voice recognition method, a device, equipment and a storage medium, which are used for solving the problem of recognition confusion in voice recognition, so that the accuracy of voice recognition is improved. The technical scheme is as follows:

a speech recognition method comprising:

extracting voice features of voice data to be recognized;

determining attribute information of voice content of the voice data to be recognized according to the voice characteristics;

and determining the voice content of the voice data to be recognized according to the voice characteristics and the attribute information of the voice content of the voice data to be recognized.

Optionally, the attribute information includes language component information, where the language component information represents component attributes of the voice content in the whole words and/or sentences formed by the voice content;

correspondingly, the determining the attribute information of the voice content of the voice data to be recognized according to the voice feature includes:

inputting the voice features into a pre-trained first decoding model, and decoding to obtain language component information of voice content of the voice data to be recognized; the first decoding model is obtained by training at least according to the decoding training characteristic sample marked with the language component information of the content to be recognized.

Optionally, the attribute information further includes part of speech type information;

correspondingly, the determining the attribute information of the voice content of the voice data to be recognized according to the voice feature further includes:

inputting the voice characteristics and the language component information of the voice content of the voice data to be recognized into a pre-trained second decoding model, and decoding to obtain the part-of-speech type information of the voice content of the voice data to be recognized; and the second decoding model is obtained by training at least according to the decoding training characteristic sample labeled with the language component information and the part of speech type information of the content to be recognized.

Optionally, the attribute information includes language component information and part of speech type information; the language component information represents the component attributes of the voice content in the whole words and/or sentences formed by the voice content;

inputting the voice characteristics into a pre-trained third decoding model, and decoding to obtain language component information and part of speech type information of voice contents of the voice data to be recognized; and the third decoding model is obtained by training at least according to the decoding training characteristic sample labeled with the language component information and the part of speech type information of the content to be recognized.

Wherein, the determining the voice content of the voice data to be recognized according to the voice feature and the attribute information of the voice content of the voice data to be recognized includes:

inputting the voice features and attribute information of the voice content of the voice data to be recognized into a pre-trained fourth decoding model, and decoding to obtain the voice content of the voice data to be recognized; and the fourth decoding model is obtained by training at least according to the decoding training characteristic sample marked with the content to be recognized and the attribute information of the content to be recognized.

Wherein the decoded training feature samples comprise speech feature samples and text feature samples.

The voice feature sample is obtained by performing voice feature extraction on a voice training sample by a preset voice encoder; the text feature sample is obtained by extracting text features of a text training sample by a preset text encoder;

the speech coder is obtained by performing feature extraction training on speech data samples at least, the text coder is obtained by performing feature extraction training on text data samples at least, and the speech coder and the text coder are subjected to joint training processing, so that the speech features output by the speech coder are the same as the feature representation distribution of the text features output by the text coder.

Wherein the joint training process of the speech coder and the text coder comprises:

respectively identifying the text features output by the text encoder and the voice features output by the voice encoder by utilizing a pre-trained discriminator model; wherein the discriminator model is obtained by at least recognizing text characteristic samples and voice characteristic samples for training

When the discriminator model can distinguish and recognize the text features output by the text encoder and the voice features output by the voice encoder, correcting the parameters of the text encoder and the voice encoder according to the negative cross entropy gradient of the discriminator model;

the above process is repeated until the discriminator model cannot distinguish between recognizing the text features output by the text encoder and the speech features output by the speech encoder.

Wherein, the recognizing the text feature output by the text encoder and the speech feature output by the speech encoder respectively by using the pre-trained discriminator model comprises:

and respectively identifying the average pooling vector of the text characteristic vector output by the text encoder and the average pooling vector of the voice characteristic vector output by the voice encoder by utilizing a pre-trained discriminator model.

Wherein, the extracting the voice feature of the voice to be recognized comprises:

and inputting the voice data to be recognized into the voice coder which is subjected to the joint training processing with the text coder, and extracting the voice characteristics of the voice data to be recognized.

A speech recognition apparatus comprising:

the characteristic extraction unit is used for extracting the voice characteristics of the voice data to be recognized;

the attribute determining unit is used for determining the attribute information of the voice content of the voice data to be recognized according to the voice characteristics;

and the content identification unit is used for determining the voice content of the voice data to be identified according to the voice characteristics and the attribute information of the voice content of the voice data to be identified.

correspondingly, when the attribute determining unit determines the attribute information of the voice content of the voice data to be recognized according to the voice feature, the attribute determining unit is specifically configured to:

correspondingly, when the attribute determining unit determines the attribute information of the voice content of the voice data to be recognized according to the voice feature, the attribute determining unit is further configured to:

Optionally, the attribute information includes language component information and part of speech type information; the language component information represents the component attribute of the voice content in the whole words and/or sentences formed by the voice content;

When determining the voice content of the voice data to be recognized according to the voice feature and the attribute information of the voice content of the voice data to be recognized, the content recognition unit is specifically configured to:

Optionally, the joint training process of the speech encoder and the text encoder includes:

When the discriminator model can distinguish and identify the text features output by the text encoder and the voice features output by the voice encoder, correcting the parameters of the text encoder and the voice encoder according to the negative cross entropy gradient of the discriminator model;

Optionally, the recognizing the text feature output by the text encoder and the speech feature output by the speech encoder respectively by using a pre-trained discriminator model includes:

A speech recognition device comprising:

a memory and a processor;

wherein the memory is connected with the processor and used for storing programs;

the processor is used for realizing the following functions by running the program stored in the memory:

extracting voice features of voice data to be recognized; determining attribute information of voice content of the voice data to be recognized according to the voice characteristics; and determining the voice content of the voice data to be recognized according to the voice characteristics and the attribute information of the voice content of the voice data to be recognized.

A storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the above-mentioned speech recognition method.

When the voice recognition method provided by the application is adopted to recognize the voice characteristics of the voice data to be recognized, the attribute information of the voice content of the voice data to be recognized is recognized firstly, then the attribute information is taken as the reference, the voice content of the voice data to be recognized is recognized, the content obtained through the processing and recognition of the process comprises the attribute information of the voice content and the specific information of the voice content, the phenomenon of recognition confusion caused by the fact that the attributes of the voice content cannot be distinguished can be effectively avoided, and the voice recognition accuracy is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a speech recognition system according to an embodiment of the present application;

FIG. 3 is a schematic diagram of another speech recognition system provided in an embodiment of the present application;

FIG. 4 is a schematic flow chart of a speech recognition method provided in the embodiments of the present application;

FIG. 5 is a schematic structural diagram of another speech recognition system provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of a training system of a speech recognition system provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of another training system of a speech recognition system provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a speech recognition device according to an embodiment of the present application.

Detailed Description

The technical scheme of the embodiment of the application is suitable for a speech recognition application scene.

Generally, the implementation process of speech recognition is to perform recognition of a speech frame or speech segment of 'breaking up whole parts' on speech data to be recognized to obtain a single character or word recognition result; then, the single character or word recognition result is subjected to the merging treatment of 'dividing into whole' to obtain the whole word or sentence, and the voice recognition result is obtained.

For example, assuming that the speech data to be recognized is speech data of a word "ABCDEF" spoken by a person, when performing speech recognition on the speech data to be recognized, the speech data is divided into speech frames of smaller units for speech recognition, generally, a plurality of speech frames correspond to one phoneme, assuming that a plurality of speech frames connected in sequence are obtained when performing speech frame division on the speech data, and the speech recognition is performed on the plurality of speech frames respectively, and the obtained recognition results are "AB", "CD", and "EF" in sequence; then, the three recognition results "AB", "CD" and "EF" are sequentially connected according to the time sequence relationship between their respective corresponding speech segments to obtain the recognition result, i.e., "ABCDEF" text.

The speech recognition process described above is typically implemented through an end-to-end speech recognition model. The end-to-end speech recognition model takes single characters or words as a modeling unit, and single characters or words can be recognized from speech data by learning the corresponding relation between the speech data and the single characters or words, so that a speech recognition result is obtained.

In some languages, the words or words contained therein are divided into attributes for distinguishing their specific usages, such as word formation rules, etc. For example, in the glue language, the sub-words that constitute the whole word are divided into stems and affixes. The stem can exist independently as the stem of the whole word, and the affix can be used as the auxiliary content of the stem, and when the stems exist before and after the stem, the stems and the stem need to be combined into the whole word.

For example, assuming that the word "ABCDEF" is a sticky word, in the speech recognition process, the sub-word recognition results are "AB", "CD" and "EF", respectively, and it is clear that "AB" is a stem, "CD" and "EF" are both prefixes, so as to be easily distinguished from the representation form, the prefixes are set to be added to represent the sub-word recognition results, that is, "CD" and "EF" can be represented as "-CD" and "-EF", respectively, when the recognition results "AB", "CD", and "-EF" are combined to obtain a whole word, the prefixes "-CD" and "-EF" are combined forward to form the whole word "ABCDEF" by combining with the stem "AB".

However, in the above-mentioned language of the adhesion language, many sub-words can be used as both stems and affixes. Therefore, whether the identity of the stem and affix attributes of the sub-words is correct or not directly affects the correctness of the finally identified whole words.

For example, suppose a person reads two words "ABCDEF" and "CDGH" in sequence to obtain a piece of speech data to be recognized, when performing speech recognition on the speech data, it is assumed that the sub-words "AB", "CD", "EF", "CD" and "GH" are recognized in sequence, for the two sub-words "CD" in the sub-words, the first "CD" is an affix, the second "CD" is a stem, and the rest "AB" and "EF" and "GH" are affixes, that is, the recognized sub-words should be "AB", "CD", "EF", "CD", and "GH" in sequence, and the sub-words are combined to obtain the speech recognition result "ABCDEF CDGH", that is, a text containing two words, which is a correct speech recognition result.

If the attributes of the sub-word "CD" cannot be correctly distinguished, if the first sub-word "CD" is recognized as a word stem, a voice recognition result "AB CDEF CDGH" is obtained, and a text containing three words is obtained; or if the second subword "CD" is recognized as an affix, the speech recognition result "ABCDEFCDGH" is obtained, i.e., a text containing only one word, which is obviously a wrong recognition result.

When the end-to-end speech recognition model is used for performing speech recognition on a language similar to the above-mentioned glue language and having attribute characteristics of text elements, in order to enable the model to accurately recognize a whole word in speech data, only a single word or word recognition process can be learned through a large amount of training data, and a processing process for obtaining the whole word by combining single words or words is learned, so that the model can reasonably splice and combine the single words or words to obtain the whole word on the basis of accurately recognizing the single words or words. However, when the end-to-end speech recognition model is not trained sufficiently, recognition of a single word or a word is easily confused, and a concatenation error occurs easily.

Based on the defects in the existing voice recognition technology, the embodiment of the application provides the voice recognition method, and attribute information recognition is added in the voice recognition process, so that the voice recognition accuracy is improved.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, a speech recognition method provided in an embodiment of the present application includes:

and S101, extracting the voice characteristics of the voice data to be recognized.

The voice data to be recognized is data for recording a voice signal generated based on a certain language, and the data can be acquired by a professional voice acquisition device, such as a recorder. The voice signal generated by a certain language includes, but is not limited to, a voice signal generated by a human speaking, and a voice signal generated by a machine simulating a human voice.

The technical scheme of the embodiment of the application aims to identify the voice content of the voice data to be identified. Based on the variability of the recognition object, the voice data to be recognized may be voice data of any duration, for example, the whole voice data of several seconds and several minutes, or voice data of one data frame length, etc.

The speech feature may specifically be a feature for representing acoustic information of the speech data to be recognized, and may be, for example, a Filter Bank feature, an MFCC feature, or a PLP feature of the speech data to be recognized. The voice feature is used as a feature representation of the voice data to be recognized, and can be used as a basis for recognizing the voice data to be recognized. As an alternative implementation, the speech features may be represented in the form of feature vectors.

And S102, determining the attribute information of the voice content of the voice data to be recognized according to the voice characteristics.

The voice content of the voice data to be recognized refers to the specific content of the language recorded in the voice data to be recognized. For example, if the voice data to be recognized is data obtained by recording a voice signal of a person speaking, the voice content of the voice data to be recognized is the specific content of the person speaking.

The attribute information of the audio content is information indicating attribute characteristics of language elements included in the audio content. The language element refers to a basic constituent unit constituting a language, for example, a word or a word constituting a language.

For example, if the speech data to be recognized is voice signal data of a certain person reading an adhesive word "ABCDEF", where the adhesive word "ABCDEF" includes sub-words "AB", "CD", and "EF", the attribute information of the speech content of the speech data to be recognized is the attribute information of each of the sub-words "AB", "CD", and "EF", and may be, for example, attribute information of the type (stem or affix) of the sub-word, the part of speech (noun or verb, etc.), and the like.

Since the specific speech content of the speech data to be recognized has not been recognized when the speech recognition process is executed to step S102, the attribute information of the speech content of the speech data to be recognized, specifically, the attribute information corresponding to the phoneme or the phoneme set included in the speech data to be recognized, is determined according to the speech features of the speech data to be recognized. Since the phoneme or the phoneme set in the speech data to be recognized corresponds to the single word or the word included in the speech content of the speech data to be recognized, the attribute information corresponding to the phoneme or the phoneme set in the speech data to be recognized, that is, the attribute information of the speech content of the speech data to be recognized is determined.

In an exemplary implementation manner, the determining of the attribute information corresponding to the phoneme or the phoneme set included in the speech data to be recognized may be determined according to a preset attribute determination rule or an attribute restriction relationship. For example, for the above-mentioned adhesion words, it is assumed that the sub-word "EF" is preset as an affix only, or that when the sub-word "CD" is adjacent to the sub-word "AB", the sub-word "AB" is set as a stem, and the sub-word "CD" is set as an affix of "AB". When recognizing the voice data of the voice reading adhesive word ABCDEF, performing voice recognition on the voice characteristics of the voice data and performing voice content attribute information recognition according to the preset attribute judgment rule and attribute limit relationship. When a phoneme or a set of phonemes corresponding to the subword "AB" is identified, the speech content corresponding to the phoneme or the set of phonemes may be determined to be a stem; when a phoneme or set of phonemes corresponding to the subwords "CD" and "EF" is identified, the phonetic content corresponding to the phoneme or set of phonemes may be determined to be an affix.

S103, determining the voice content of the voice data to be recognized according to the voice characteristics and the attribute information of the voice content of the voice data to be recognized.

After determining the attribute information of the voice content of the voice to be recognized, that is, determining the attribute information of each linguistic element included in the voice to be recognized, the embodiment of the application further identifies the specific content of each linguistic element included in the voice data to be recognized by using the voice feature of the voice data to be recognized on this basis. After the attribute information of the voice content of the voice data to be recognized and the specific content of the voice content are respectively determined, the voice recognition result can be determined according to the specific content and the attribute information of each language element contained in the voice data to be recognized.

As a preferred implementation manner, an embodiment of the present application provides that the specific processing procedure for determining the voice content of the voice data to be recognized according to the voice feature and the attribute information of the voice content of the voice data to be recognized is as follows:

and performing voice content decoding processing on the voice features by taking the attribute information of the voice content of the voice data to be recognized as a limit to obtain the voice content corresponding to the attribute information of the voice content, namely decoding the voice features in a search space defined by the attribute information of the voice content to determine the voice content.

For example, assuming that voice recognition is performed on the voice signal data of the above-mentioned glue word "ABCDEF", when it is determined that the first subword of the voice data is a stem, and when the specific content of the first subword of the voice data is further recognized, the voice feature is decoded only in the stem space, and a subword serving as a stem is decoded as a recognition result of the first subword.

Further, when the speech data to be recognized includes a plurality of language elements, that is, includes a plurality of single words or words, the processing of determining the speech recognition result may specifically include merging specific contents of the single words or words according to the attribute information of the single words or words obtained by the recognition to obtain a whole word or a whole sentence, so as to obtain the speech recognition result.

For example, taking the recognition of the voice signal data of the above-mentioned glue word "ABCDEF" as an example, if it is determined in step S102 that the attributes of the three sub-words included in the voice data are respectively stem, affix, and affix in turn, respectively, then when the three sub-words included in the voice data are respectively recognized and obtained in step S103 as "AB", "CD", and EF ", it is determined that the voice contents to be merged are respectively" AB "," CD ", and" EF ", and the voice contents to be merged are merged according to the predefined attribute feature word-building rule, so as to obtain the recognition result, i.e.," ABCDEF ".

When the voice data to be recognized only includes one single word or word, there is no processing to combine with other single words or words, and the recognized single word or word is directly used as the voice recognition result.

It can be seen from the above description that, when the speech feature of the speech data to be recognized is recognized, the speech recognition method provided in the embodiment of the present application recognizes the attribute information of the speech content of the speech data to be recognized, and then recognizes the speech content of the speech data to be recognized with the attribute information as a reference, and the content obtained through the processing and recognition in the above process includes the attribute information of the speech content and the specific information of the speech content, so that the phenomenon of recognition confusion caused by the fact that the attribute of the speech content cannot be distinguished can be effectively avoided, and the accuracy of speech recognition can be improved.

As a preferred implementation manner, the embodiment of the present application implements a specific processing procedure of the proposed speech recognition method by means of a pre-trained model, including extraction of speech features, recognition of attribute information, speech content recognition, and the like.

Further, according to the division of the speech recognition process, the extraction of the speech features may be implemented by a speech coding process, and the processes of the attribute information recognition and the speech content recognition may be implemented by a feature decoding process, so that the coding models may be respectively configured to implement the speech feature extraction, and the decoding models may be configured to implement the attribute information recognition and the speech content recognition processes.

Based on the above setting, the speech recognition method provided in the embodiment of the present application may be implemented based on the speech recognition system shown in fig. 2. The voice recognition system is composed of a coding model 1 and a decoding model 2, wherein the coding model 1 is used for carrying out feature extraction processing on voice data to be recognized to obtain voice features; the coding model 1 extracts the obtained voice features as an input of the decoding model 2, and the decoding model 2 performs feature decoding processing, specifically, identifying attribute information of voice content of the voice data to be recognized, and identifying specific voice content of the voice data to be recognized.

Further, the speech recognition model shown in fig. 2 may be further specifically divided into different sub models or functional modules according to different functions. Referring to fig. 3, the coding model 1 may include a plurality of coding layers, where each coding layer is configured to perform a layer of coding process on the speech data X to be recognized, and an attention module, where the attention module is configured to combine a decoding state of the decoding model 2 at a previous time with a feature vector output by a last coding layer of the coding model to form an attention weight, and then perform weighted summation on the attention weight and the feature vector output by the last coding layer of the coding model according to a vector dimension to obtain a speech feature vector of the speech data at the current time

The decoding model 2 described above can be further divided into an attribute decoding model 21 and a content decoding model 22.

The attribute decoding model 21 is configured to execute step S102 in the speech recognition method shown in fig. 1, which is proposed by the present application, and determine attribute information of the speech content of the speech data to be recognized according to the speech feature of the speech data to be recognized, that is, perform attribute decoding on the speech feature of the speech data to be recognized, and determine attribute information of the speech content of the speech data to be recognized.

The content decoding model 22 is configured to execute step S103 in the speech recognition method proposed in the present application and shown in fig. 1, determine the speech content of the speech data to be recognized according to the speech feature of the speech data to be recognized and the attribute information of the speech content of the speech data to be recognized, that is, decode the speech feature of the speech data to be recognized according to the attribute information determined by the attribute decoding model 21, and determine the speech content of the speech data to be recognized.

As an alternative implementation, referring to fig. 3, when the coding model 1 codes the speech data X to be recognized, the coding is performed in units of data frames, that is, each data frame X shown in the figure is coded separately₁、X₂…X_tAnd respectively encoding. Accordingly, decoding model 2 is in progressWhen the speech feature of the speech data to be recognized is decoded, it may be performed in units of frames.

Optionally, another embodiment of the present application discloses that, in the processing step S103 in the speech recognition method shown in fig. 1, according to the speech feature and the attribute information of the speech content of the speech data to be recognized, the specific implementation process of determining the speech content of the speech data to be recognized is as follows:

and inputting the voice characteristics and the attribute information of the voice content of the voice data to be recognized into a pre-trained fourth decoding model, and decoding to obtain the voice content of the voice data to be recognized.

It is to be understood that the fourth decoding model, namely, the decoding model trained to implement the speech content decoding function of the content decoding model 22, can be used as the content decoding model 22 shown in fig. 3 to perform the speech content decoding process, that is, to perform the processing step S103 in the speech recognition method shown in fig. 1 and proposed in the present application.

The fourth decoding model is obtained by training at least according to the marked content to be recognized and the decoding training characteristic sample marked with the attribute information of the marked content to be recognized.

The decoding training feature sample refers to a feature sample used for performing decoding training on the fourth decoding model, and the feature sample is specifically a feature obtained by performing feature extraction on training sample data. Illustratively, the decoded training feature samples may be represented in the form of feature vectors.

The content to be recognized is the specific content in the data to be recognized corresponding to the decoding training feature sample. When the decoding training feature sample is a feature sample of voice data, the content to be recognized is the voice content of the voice data.

And on the basis that the content to be identified and the attribute information of the content to be identified are marked on the decoding training characteristic sample, inputting the decoding training characteristic sample and the attribute information of the content to be identified into the fourth decoding model, so that the fourth decoding model performs content decoding on the decoding training characteristic sample according to the attribute information of the content to be identified, and determining the specific content of the decoding training characteristic sample.

And then comparing the specific content obtained by decoding the fourth decoding model with the marked content to be identified, and performing reverse parameter correction on the fourth decoding model according to the difference between the specific content and the marked content to be identified.

And repeating the identification and the reverse parameter correction until the identification accuracy of the fourth decoding model reaches the preset requirement, and stopping training.

The trained fourth decoding model can be used for performing content decoding processing on the voice features of the voice data to be recognized under the limitation of the attribute information of the voice content of the voice data to be recognized, and determining the voice content of the voice data to be recognized.

It should be noted that the attribute information of the voice content of the voice data to be recognized may be flexibly set according to the language characteristics of the voice data to be recognized, and the attribute information may also be a general term of multiple kinds of attribute information. For example, for the above-mentioned language type of the glue language, the above-mentioned attribute information can be the part of speech, type, etc. of the sub-word; other languages include sentence component information when the language elements constitute sentences. In short, the attribute information indicates unique attribute information that distinguishes a certain language element from another language element in a certain dimension or dimensions, and the attribute information specifies a composition rule when the language element constitutes a whole word or a whole sentence, and is attribute information that affects the language composition.

In the embodiment of the present application, the specific processing procedure of the speech recognition method provided by the present application is exemplarily described by taking speech recognition of a sticky note as an example, because the attribute division dimensions of speech elements in different languages are different, so that specific contents of attribute information obtained by speech recognition for different languages are different. It should be understood that the embodiments of the present application are merely exemplary and not restrictive, and that speech recognition in any language may be implemented by referring to the descriptions of the embodiments of the present application and fall within the scope of the embodiments of the present application.

Specifically, in another embodiment of the present application, it is disclosed that the attribute information includes language component information.

The language component information is used for representing component attributes of the voice content in the whole words and/or sentences formed by the voice content.

For example, for a single word or word in a Chinese sentence, it may be used as a subject, a predicate, an object, etc. in the sentence; for a word or word in a certain glue phrase, it may be used as a stem or affix component in the whole word in which it is located.

It can be understood that the language component information of the voice content (including single words, words and the like), that is, the component attributes when the voice content constitutes a whole word or sentence are determined, so that the method has an auxiliary effect on the subsequent voice recognition, and can avoid recognition errors caused by the attributes of the uncertain voice content.

Correspondingly, the specific processing procedure for determining the attribute information of the voice content of the voice data to be recognized according to the voice feature of the voice data to be recognized is as follows:

and inputting the voice characteristics into a pre-trained first decoding model, and decoding to obtain language component information of voice content of the voice data to be recognized.

The first decoding model is obtained by training at least according to the decoding training characteristic sample of the voice component information marked with the content to be recognized.

Specifically, the decoding training feature sample refers to a feature sample used for decoding training of the first decoding model, and the feature sample is specifically a feature obtained by performing feature extraction on training sample data. Illustratively, the decoded training feature samples may be represented in the form of feature vectors.

The content to be identified is the specific content in the data to be identified corresponding to the decoding training feature sample. When the decoding training characteristic sample is a characteristic sample of voice data, the content to be recognized is the voice content of the voice data; or, when the decoding training feature sample is a feature sample of text data, the content to be recognized is the text content of the text data.

And on the basis that the decoding training characteristic sample is marked with the language component information of the content to be recognized, inputting the decoding training characteristic sample into the first decoding model, enabling the first decoding model to decode the language component of the decoding training characteristic sample according to the language component information of the content to be recognized, and determining the language component information of the voice content corresponding to the decoding training characteristic sample.

And then comparing the language component information of the voice content obtained by decoding the first decoding model with the language component information of the marked content to be recognized, and performing reverse parameter correction on the first decoding model according to the difference between the language component information of the marked content to be recognized and the language component information of the marked content to be recognized.

And repeating the identification and the reverse parameter correction processing until the identification accuracy of the first decoding model reaches a preset requirement, and stopping training.

The trained first decoding model can be used for decoding the language component information of the voice content of the voice data to be recognized and determining the language component information of the voice content of the voice data to be recognized.

For example, assuming that the speech data to be recognized is speech data of the human-spoken language "abcdef", the speech features of the speech data to be recognized are input into the trained first decoding model, so that the language component information of each subword included in the speech data to be recognized can be recognized, that is, whether each subword included in the speech data to be recognized is a stem or an affix can be determined. Assuming that the first sub-word is determined to be a stem, the second sub-word is an affix and the third sub-word is determined to be a stem through recognition, the first decoding model outputs a recognition result of stem and affix stem, which means that the first sub-word of the input voice data to be recognized is a stem, the second sub-word is an affix and the third sub-word is a stem.

Further, another embodiment of the present application further discloses that the attribute information further includes part-of-speech type information on the basis of including language component information.

The part-of-speech type information, i.e. the part-of-speech type indicating the content of speech in the speech data to be recognized, may be, for example, nouns, verbs, etc. when the speech data to be recognized is sticky speech data.

Correspondingly, as shown in fig. 4, on the basis of executing step S402, inputting the voice feature into the pre-trained first decoding model, and decoding to obtain the language component information of the voice content of the voice data to be recognized, the determining, according to the voice feature, the attribute information of the voice content of the voice data to be recognized further includes:

and S403, inputting the voice characteristics and the language component information of the voice content of the voice data to be recognized into a pre-trained second decoding model, and decoding to obtain the part of speech type information of the voice content of the voice data to be recognized.

And the second decoding model is obtained by training at least according to the decoding training characteristic sample labeled with the language component information and the part of speech type information of the content to be recognized.

Specifically, referring to the training process of the first decoding model introduced in the foregoing embodiment, in the embodiment of the present application, the second decoding model is trained by using the decoding training feature sample labeled with the language component information and the part-of-speech type information of the content to be recognized, so that the speech feature of the speech data to be recognized can be further decoded on the basis of the language component information of the speech content of the known speech data to be recognized, and the part-of-speech type information of the speech content of the speech data to be recognized is obtained by decoding.

As shown in fig. 5, after inputting the speech features of the speech data to be recognized into the pre-trained first decoding model, and decoding to obtain the language component information of the speech content of the speech data to be recognized, the output of the first decoding model and the speech features of the speech data to be recognized are simultaneously used as the input of the trained second decoding model, and the speech features of the speech data to be recognized are further decoded by the second decoding model on the basis of knowing the language component information of the speech content of the speech data to be recognized, so as to obtain the part-of-speech type information of the speech content of the speech data to be recognized.

For example, assuming that the speech data to be recognized is speech data of the human spoken utterance "abcdef", the speech features of the speech data to be recognized are input into the trained first decoding model, so that the language component information of each subword included in the speech data to be recognized can be recognized. Assuming that the first decoding model determines that a first sub-word of the voice data to be recognized is a stem, a second sub-word is an affix and a third sub-word is a stem through decoding, the first decoding model outputs a recognition result of 'stem affix stem', which means that the first sub-word of the input voice data to be recognized is a stem, the second sub-word is an affix and the third sub-word is a stem.

Then, referring to the decoding model architecture shown in fig. 5, the output of the first decoding model and the speech feature of the speech data to be recognized are simultaneously used as the input of the second decoding model, and the second decoding model further decodes the part-of-speech type of the speech content of the speech data to be recognized, and respectively decodes the part-of-speech types of the three subwords included in the speech data to be recognized. Supposing that the second decoding model decodes and determines that the first subword serving as the word stem of the voice data to be recognized is a noun-type subword, determining that the first subword of the voice data to be recognized is a noun word stem; supposing that the second decoding model decodes and determines that the subword of the second voice data to be recognized as the affix is a lattice type subword, determining that the second subword of the voice data to be recognized is a lattice affix; assuming that the second decoding model determines that the third sub-word of the voice data to be recognized, which is taken as the word stem, is the sub-word of the verb type, determining that the third sub-word of the voice data to be recognized is the verb word stem; at this time, the second decoding model outputs "the noun stem lattice belongs to the affix verb stem", which means that the first sub-word of the speech data to be recognized is the stem of the noun type, the second sub-word is the affix of the lattice type, and the third sub-word is the stem of the verb type.

After determining the attribute information formed by the language component information and the part of speech type information of the speech content of the speech data to be recognized through the steps S402 and S403, the step S404 is executed to determine the speech content of the speech data to be recognized according to the speech feature and the attribute information of the speech content of the speech data to be recognized, specifically, the step S404 is executed to determine the speech content of the speech data to be recognized according to the speech feature and the language component information and the part of speech type information of the speech content of the speech data to be recognized.

The specific working contents of steps S401 and S404 in the method embodiment shown in fig. 4 can refer to steps S101 and S103 in the method embodiment shown in fig. 1, and are not described herein again.

The above embodiment describes that, when the above attribute information includes two-aspect contents, namely, language component information and part-of-speech type information, the two-aspect contents are recognized by using the first decoding model and the second decoding model trained in advance, respectively.

It is to be understood that when the specific content included in the attribute information is determined, the specific content included in the attribute information may also be identified through a decoding model.

Illustratively, in another embodiment of the present application, it is disclosed that, when the attribute information includes both a language component and part-of-speech type information, the determining the attribute information of the speech content of the speech data to be recognized according to the speech feature of the speech data to be recognized includes:

and inputting the voice characteristics into a pre-trained third decoding model, and decoding to obtain language component information and part of speech type information of voice content of the voice data to be recognized.

The third decoding model is obtained by training at least according to the language component information marked with the content to be recognized and the decoding training characteristic sample of the part of speech type information.

The decoding training feature sample refers to a feature sample used for performing decoding training on the third decoding model, and the feature sample is specifically a feature obtained by performing feature extraction on training sample data. Illustratively, the decoded training feature samples may be represented in the form of feature vectors.

And on the basis that the decoding training characteristic sample is marked with the language component information and the part of speech type information of the content to be recognized, inputting the decoding training characteristic sample into the third decoding model, enabling the third decoding model to perform language component decoding and part of speech type decoding on the decoding training characteristic sample, and determining the language component information and the part of speech type information of the voice content corresponding to the decoding training characteristic sample.

And then comparing the language component information and the part of speech type information of the voice content obtained by decoding the third decoding model with the language component information and the part of speech type information of the marked content to be recognized, and performing reverse parameter correction on the third decoding model according to the difference between the language component information and the part of speech type information.

And repeating the identification and the reverse parameter correction until the identification accuracy of the third decoding model reaches the preset requirement, and stopping training.

The trained third decoding model can be used for decoding and recognizing the language component information and the part of speech type information of the voice content of the voice data to be recognized, and determining the language component information and the part of speech type information of the voice content of the voice data to be recognized.

For example, assuming that the speech data to be recognized is speech data of the human spoken utterance "abcdef", the speech features of the speech data to be recognized are input into the trained third decoding model, so that the language component information and the part-of-speech type information of each subword included in the speech data to be recognized can be recognized, that is, it is determined whether each subword included in the speech data to be recognized is a stem or an affix, and it is determined whether each subword included in the speech data to be recognized is a noun or a verb, and the like. Assuming that it is determined by recognition that a first sub-word of the speech data to be recognized is a stem of a noun type, a second sub-word is an affix of a lattice type, and a third sub-word is a stem of a verb type, the first decoding model outputs a recognition result "the noun stem of the lattice belongs to the affix verb stem".

In the embodiments, the specific processing procedures for determining the attribute information of the speech content of the speech data to be recognized are respectively described by taking the example that the attribute information includes the language component information, the part-of-speech type information, and the speech component information and the part-of-speech type information at the same time. Furthermore, the above embodiments of the present application determine the attribute information of the speech content of the speech data to be recognized by means of a decoding model.

It should be noted that the above embodiments are merely exemplary implementations, and when the specific content included in the attribute information is different, the specific function of each decoding model is changed accordingly. Moreover, the decoding models may exist independently in a decoding model architecture similar to that shown in fig. 5, or may exist as different functional components of the entire decoding model, and when the technical solution of the embodiment of the present application is applied, the existence forms of the decoding models may be flexibly set. It is understood that, no matter how the specific content of the attribute information is changed or how the decoding models described in the above embodiments have a relationship with each other, the implementation of the speech recognition method proposed in the present application can be implemented by referring to the descriptions of the above embodiments of the present application, and all of the implementations are within the scope of the embodiments of the present application.

By introducing the above embodiments of the present application, it can be determined that, when the decoding of the attribute information is implemented by using the above decoding models and the speech content is decoded according to the attribute information of the speech content, training of each decoding model is critical, and the decoding model has a decoding capability meeting requirements only through sufficient training. In the embodiment of the present application, it is set that each of the decoding models is trained using the decoding training feature samples.

The speech recognition method provided by the embodiment of the application aims to recognize the speech data to be recognized more accurately, so that the characteristics of the speech data sample are preferentially taken as the decoding training characteristic sample to train the decoding model.

Further, in order to improve the decoding capability of the decoding model, in another embodiment of the present application, it is further disclosed that the decoding training feature samples include speech feature samples and text feature samples, that is, on the basis of training the decoding model by using the features of the speech data samples, the features of the text samples are also used as training samples for training the capability of the decoding model for decoding the attribute information and the speech content.

Referring to fig. 6, the embodiment of the present application further discloses that the voice feature sample is obtained by performing voice feature extraction on a voice training sample by a preset voice encoder 31, and the text feature sample is obtained by performing text feature extraction on a text training sample by a preset text encoder 32.

That is, in the embodiment of the present application, the speech encoder 31 and the text encoder 32 shown in fig. 6 are respectively configured to perform feature extraction on a speech training sample to obtain a speech feature sample, and perform feature extraction on a text training sample to obtain a text feature sample, where the obtained speech feature sample and the obtained text feature sample are simultaneously used as decoding training feature samples of a decoding model.

The speech encoder 31 shown in fig. 6 is obtained by performing at least feature extraction training on speech data samples. Similarly, the text encoder 32 at least obtains the text data samples by performing feature extraction training.

The specific structures of the speech encoder 31 and the text encoder 32 may be the same as the structure of the encoding model shown in fig. 3, that is, the specific structures may include multiple encoding layers and an attention module, where each encoding layer is used to perform a layer of encoding processing on the speech data X and the text data C to be recognized, the attention module is used to form an attention weight in combination with the decoding state of the decoding model 2 at the previous time, and then the attention weight and a feature vector output by the last encoding layer of the encoding model are weighted and summed according to a vector dimension to obtain a speech feature vector of the speech data at the current time.

On the other hand, as shown in fig. 6, the decoding model 2 for implementing attribute information recognition and speech content recognition in the speech recognition method proposed by the present application has a learning domain unity limitation. That is, the decoding model can only learn the decoding capability for a particular type of feature, and when the decoding model learns the decoding capability for another type of feature again, the previously learned decoding capability is lost. For example, assuming that the same decoding model has been trained to obtain the speech feature decoding capability, when the decoding model is trained to decode again with text features, the decoding model will slowly learn the decoding capability of the text features and slowly lose the previously obtained speech feature decoding capability due to the different representation of the text features and the speech features.

In view of the above characteristics of the decoding model 2, in the embodiment of the present application, when the above-mentioned speech encoder 31 and text encoder 32 are simultaneously used for training the decoding model 2, the speech encoder 31 and text encoder 32 are subjected to joint training processing, so that the speech features output by the speech encoder 31 and the text features output by the text encoder 32 have the same feature characterization distribution, and then the speech features and the text features with the same feature characterization distribution output by the speech encoder 31 and text encoder 32 are both used as training data for training the decoding model 2, and are used for performing subsequent training on the decoding model 2.

The speech features output by the speech encoder 31 are the same as the feature characterization distributions of the text features output by the text encoder 32, that is, the speech features and the text features conform to the same feature distributions. For example, when the speech feature and the text feature are both represented in the form of feature vectors, assuming that the features represented by the speech feature vectors conform to a gaussian distribution, when the features represented by the text feature vectors also conform to a gaussian distribution, the speech feature vectors are the same as the feature characterization distribution of the text feature vectors.

The above-mentioned joint training process can make the speech encoder 31 and the text encoder 32 perform feature encoding on the speech training samples and the text training samples in the same encoding vector space, and the feature characterization distributions of the speech features and the text features obtained by final encoding are the same.

The feature samples with the same feature representation distribution are input into the decoding model 2 as decoding training feature samples for training, the phenomenon that the decoding model 2 is disordered due to different obtained feature representation distributions can be avoided, and the training effect can be consistently improved.

In another embodiment of the present application, a specific process of performing the joint training process for the speech encoder 31 and the text encoder 32 is disclosed.

Referring to fig. 7, the embodiment of the present application implements the joint training process for the speech encoder 31 and the text encoder 32 by means of the pre-trained discriminator model 40.

The discriminator model 40 is trained at least by recognizing text feature samples and speech feature samples. Specifically, preset text feature samples and preset speech feature samples (which may be speech features output by the speech encoder 31 and text features output by the text encoder 32) are respectively input into the discriminator model 40 for training, so that the discriminator model 40 distinguishes and identifies the text feature samples and the speech feature samples, and when the discriminator model 40 can accurately distinguish and identify any input text feature samples and speech feature samples, the operation parameters of the discriminator model 40 are fixed. At this point, the training process of the discriminator model 40 is complete, which has the ability to accurately distinguish between recognized text features and speech features.

The above-described discriminator model 40 is then added to the joint training process for the speech coder 31 and the text coder 32, as shown in fig. 7.

At this time, the speech features output by the speech encoder 31 and the text features output by the text encoder 32 are recognized by the discriminator model 40;

when the discriminator model 40 can distinguish and recognize the speech features output by the speech encoder 31 and the text features output by the text encoder 32, the parameters of the speech encoder 31 and the text encoder 32 are corrected according to the negative cross entropy gradient of the discriminator model 40.

The negative cross entropy gradient of the discriminator model 40 means a gradient of change from a case where the discriminator model 40 can accurately distinguish between the speech feature output by the speech encoder 31 and the text feature output by the text encoder 32 to a case where the speech feature output by the speech encoder 31 and the text feature output by the text encoder 32 cannot be distinguished.

The parameters of the speech encoder 31 and the text encoder 32 are corrected according to the negative cross entropy gradient, and the characteristics of the output of the speech encoder 31 and the text encoder 32 after the parameter correction are recognized by the discriminator model 40 again. The above feature recognition and parameter correction process is repeated so that the discriminator model 40 exhibits the feature that cannot discriminate the output of the recognition speech encoder 31 and the text encoder 32 at the fastest speed along its negative cross entropy gradient. At this time, the speech features output by the speech encoder 31 and the feature characterization distributions of the text features output by the text encoder 32 can be considered to be the same, so that the above-mentioned discriminator model 40 cannot be distinguished.

As an optional implementation manner, when the pre-trained discriminator model 40 is used to respectively recognize the speech features output by the speech encoder 31 and the text features output by the text encoder 32, in order to ensure comparability of the speech features and the text features, as shown in fig. 7, in the embodiment of the present application, after the speech features output by the speech encoder 31 and the text features output by the text encoder 32 are respectively averaged and pooled, the speech features and the text features are distinguished and recognized by the discriminator of the discriminator model 40. Illustratively, the average pooled vector of speech feature vectors output by speech coder 31 and the average pooled vector of text feature vectors output by text coder 32 are identified using the previously trained discriminator model 40 described above.

Further, in order to ensure that the features output by the trained speech encoder 31 and text encoder 32 are efficiently decoded by the decoding model 2 shown in fig. 7, thereby speeding up the training process of the decoding model 2 and improving the training effect, in the embodiment of the present application, when the parameters of the speech encoder 31 and text encoder 32 are corrected according to the recognition result of the discriminator model 40, the gradient of the decoding model 2 is also used as the reference for parameter correction, that is, the parameters of the speech encoder 31 and text encoder 32 are corrected according to the gradient of the decoding model 2 at the same time. At this time, after the parameters of the speech encoder 31 and the text encoder 32 are corrected, the feature characterization distributions of the features output by the speech encoder 31 and the text encoder 32 are the same, and meanwhile, the features output by the speech encoder and the text encoder can be decoded by the decoding model 2 more efficiently, so that the model training effect is improved, and the model speech recognition efficiency is more favorably improved.

Specifically, let L be the loss function of decoding model 2 at this time_D1Including speech data decoding loss and text data decoding loss. Wherein the cross entropy loss function of the decoding model 2 generated by the decoding of the speech data is L_D1SThe cross entropy loss function of decoding model 2 resulting from decoding of text data is L_D1T. And the cross entropy loss function of the discriminator model 40 is L_D2Then, the update strategy of the network parameters of each part of the speech recognition system shown in fig. 7 is as follows:

wherein theta is_E1Being a network parameter of the speech encoder 31, theta_E2Is a network parameter, θ, of the text encoder 32_D1To decode the network parameters of model 2, θ_D2For the network parameters of the discriminator model 40, l is the learning rate.

According to the calculation formula shown in the strategy, correcting each network parameter in the joint training process until the L is reached_D1SAnd L_D1TStop training when no longer changingAnd (6) carrying out the process.

Through the above-described joint training process, the feature characterization distributions of the features output by the speech encoder 31 and the text encoder 32 shown in fig. 7 gradually tend to be the same, and at this time, the features output by the speech encoder 31 and the text encoder 32 are continuously used for training the decoding model 2, so that the functional training of the decoding model 2 is gradually completed.

After the training of the decoding model 2 is completed, the text encoder 32 and the discriminator model 40 in the speech recognition system shown in fig. 7 may be omitted, and only the speech encoder 31 and the decoding model 2 after the joint training with the text encoder 32 are retained to form a speech recognition system model. In this case, the speech encoder 31 in the speech recognition system corresponds to the coding model 1 shown in fig. 2.

Correspondingly, when the processing step S101 of the speech recognition method provided by the embodiment of the present application shown in fig. 1 is executed to extract the speech features of the speech data to be recognized, the speech data to be recognized is input to the speech encoder 31 after the joint training processing of the text encoder 32, and the speech features of the speech data to be recognized are extracted.

It is to be understood that, since the above-mentioned embodiments of the present application have described that the decoding model 2 shown in fig. 2 is trained based on the jointly trained speech encoder 31 and text encoder 32, when the decoding model 2 is applied to perform speech recognition, it can only be performed on the same features as the feature characterization distributions of the features output by the speech encoder 32. Therefore, the speech encoder 31 for training the decoding model 2 is directly used as the encoding model of the whole speech recognition system in the embodiment of the present application, and the validity of the speech recognition system function can be ensured.

Corresponding to the above speech recognition method, an embodiment of the present application further provides a speech recognition apparatus, as shown in fig. 8, the apparatus includes:

a feature extraction unit 100 for extracting a voice feature of voice data to be recognized;

an attribute determining unit 110, configured to determine attribute information of the voice content of the voice data to be recognized according to the voice feature;

a content identification unit 120, configured to determine the voice content of the voice data to be identified according to the voice feature and the attribute information of the voice content of the voice data to be identified.

The voice recognition device provided by the embodiment of the application extracts the voice features of the voice data to be recognized by the feature extraction unit 100, when the voice features of the voice data to be recognized are recognized, the attribute information of the voice content of the voice data to be recognized is recognized by the attribute determination unit 110, then the voice content of the voice data to be recognized is recognized by the content recognition unit 120 with the attribute information as reference, and the content obtained through the processing and recognition in the above process includes the attribute information of the voice content and the specific information of the voice content, so that the phenomenon of recognition confusion caused by the fact that the attributes of the voice content cannot be distinguished can be effectively avoided, and the voice recognition accuracy can be improved.

Optionally, in another embodiment of the present application, it is proposed that the attribute information includes language component information, where the language component information indicates component attributes of the speech content in the whole words and/or sentences formed by the speech content;

correspondingly, when the attribute determining unit 110 determines the attribute information of the voice content of the voice data to be recognized according to the voice feature, the attribute determining unit is specifically configured to:

Optionally, in another embodiment of the present application, it is provided that the attribute information further includes part-of-speech type information;

correspondingly, when the attribute determining unit 110 determines the attribute information of the voice content of the voice data to be recognized according to the voice feature, the attribute determining unit is further configured to:

Optionally, in another embodiment of the present application, it is provided that the attribute information includes language component information and part of speech type information; the language component information represents the component attributes of the voice content in the whole words and/or sentences formed by the voice content;

As an optional implementation manner, when determining the voice content of the voice data to be recognized according to the voice feature and the attribute information of the voice content of the voice data to be recognized, the content recognition unit 120 is specifically configured to:

inputting the voice characteristics and attribute information of the voice content of the voice data to be recognized into a pre-trained fourth decoding model, and decoding to obtain the voice content of the voice data to be recognized; and the fourth decoding model is obtained by training at least according to the decoding training characteristic sample marked with the content to be recognized and the attribute information of the content to be recognized.

In an alternative implementation, the decoding training feature samples include speech feature samples and text feature samples.

As an optional implementation manner, the speech feature sample is obtained by performing speech feature extraction on a speech training sample by using a preset speech encoder; the text feature sample is obtained by extracting text features of a text training sample by a preset text encoder;

As an optional implementation manner, the joint training process of the speech encoder and the text encoder includes:

As an optional implementation manner, the separately recognizing the text feature output by the text encoder and the speech feature output by the speech encoder by using the pre-trained discriminator model includes:

As an optional implementation manner, the extracting the speech feature of the speech to be recognized includes:

and inputting the voice data to be recognized into the voice coder after the combined training processing of the voice data to be recognized and the text coder, and extracting the voice characteristics of the voice data to be recognized.

Another embodiment of the present application further provides a speech recognition apparatus, as shown in fig. 9, the apparatus including:

a memory 200 and a processor 210;

wherein, the memory 200 is connected to the processor 210 for storing programs;

the processor 210 is configured to implement the following functions by running the program stored in the memory 200:

Specifically, the voice recognition device may further include: a bus, a communication interface 220, an input device 230, and an output device 240.

The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are connected to each other through a bus. Wherein:

a bus may comprise a path that transfers information between components of a computer system.

The processor 210 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the present invention. But may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The processor 210 may include a main processor and may also include a baseband chip, modem, and the like.

The memory 200 stores programs for executing the technical solution of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer operating instructions. More specifically, memory 200 may include a read-only memory (ROM), other types of static storage devices that may store static information and instructions, a Random Access Memory (RAM), other types of dynamic storage devices that may store information and instructions, a disk storage, a flash, and so forth.

The input device 230 may include a means for receiving data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.

Output device 240 may include equipment that allows output of information to a user, such as a display screen, a printer, speakers, and the like.

Communication interface 220 may include any means for using a transceiver or the like to communicate with other devices or communication networks, such as ethernet, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The processor 2102 executes programs stored in the memory 200 and invokes other devices that may be used to implement the steps of the speech recognition methods provided by the embodiments of the present application.

Another embodiment of the present application further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the speech recognition method provided in any of the above embodiments.

While, for purposes of simplicity of explanation, the foregoing method embodiments are presented as a series of acts or combinations, it will be appreciated by those of ordinary skill in the art that the present application is not limited by the illustrated ordering of acts, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The steps in the method of the embodiments of the present application may be sequentially adjusted, combined, and deleted according to actual needs.

The modules and units in the device and the terminal in the embodiments of the present application can be combined, divided and deleted according to actual needs.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, a division of a module or a unit is only one logical division, and an actual implementation may have another division, for example, a plurality of units or modules may be combined or integrated into another module or unit, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules or units described as separate parts may or may not be physically separate, and parts that are modules or units may or may not be physical modules or units, may be located in one place, or may be distributed on a plurality of network modules or units. Some or all of the modules or units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech recognition method, comprising:

inputting voice data to be recognized into a voice coder after the voice data is subjected to combined training with a text coder, and extracting voice characteristics of the voice data to be recognized; wherein the voice features and the feature representation distribution of the text features output by the text encoder are the same;

2. The method according to claim 1, wherein the attribute information includes language component information representing component attributes of the speech content in the whole words and/or sentences composed thereof;

3. The method of claim 2, wherein the attribute information further includes part-of-speech type information;

4. The method according to claim 1, wherein the attribute information includes language component information and part of speech type information; the language component information represents the component attributes of the voice content in the whole words and/or sentences formed by the voice content;

5. The method according to claim 1, wherein the determining the voice content of the voice data to be recognized according to the voice feature and the attribute information of the voice content of the voice data to be recognized comprises:

6. The method of any of claims 2-5, wherein the decoding training feature samples comprise speech feature samples and text feature samples.

7. The method according to claim 6, wherein the speech feature sample is obtained by performing speech feature extraction on a speech training sample by a preset speech coder; the text feature samples are obtained by performing text feature extraction on the text training samples through a preset text encoder;

8. The method of claim 7, wherein the joint training process of the speech coder and the text coder comprises:

respectively identifying the text features output by the text encoder and the voice features output by the voice encoder by utilizing a pre-trained discriminator model; the discriminator model is obtained by training at least through recognizing text characteristic samples and voice characteristic samples;

9. The method of claim 8, wherein the separately recognizing the text features output by the text coder and the speech features output by the speech coder using a pre-trained discriminator model comprises:

10. A speech recognition apparatus, comprising:

the feature extraction unit is used for inputting the voice data to be recognized into the voice encoder which is subjected to the joint training processing with the text encoder, and extracting the voice features of the voice data to be recognized; wherein the voice features and the feature representation distribution of the text features output by the text encoder are the same;

11. The apparatus according to claim 10, wherein the attribute information includes language component information indicating component attributes of the speech content in the whole words and/or sentences composed thereof;

12. The apparatus of claim 11, wherein the attribute information further comprises part-of-speech type information;

13. The apparatus according to claim 10, wherein the attribute information includes language component information and part of speech type information; the language component information represents the component attribute of the voice content in the whole words and/or sentences formed by the voice content;

14. The apparatus according to claim 10, wherein the content recognition unit, when determining the voice content of the voice data to be recognized according to the voice feature and the attribute information of the voice content of the voice data to be recognized, is specifically configured to:

15. A speech recognition device, comprising:

a memory and a processor;

inputting voice data to be recognized into a voice coder after the voice data is subjected to combined training with a text coder, and extracting voice characteristics of the voice data to be recognized; wherein the voice features and the feature characterization distribution of the text features output by the text encoder are the same; determining attribute information of voice content of the voice data to be recognized according to the voice characteristics; and determining the voice content of the voice data to be recognized according to the voice characteristics and the attribute information of the voice content of the voice data to be recognized.

16. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the speech recognition method according to one of claims 1 to 9.