CN115019773A

CN115019773A - Voice recognition method and related device, electronic equipment and storage medium

Info

Publication number: CN115019773A
Application number: CN202210746650.5A
Authority: CN
Inventors: 方昕
Original assignee: iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2022-09-06

Abstract

The application discloses a voice recognition method, a related device, electronic equipment and a storage medium, wherein the voice recognition method comprises the following steps: acquiring a target language to which a voice to be recognized belongs, and acquiring respective voice recognition models of a plurality of languages; the method comprises the steps that a plurality of language families are obtained by analyzing a sample sub-word sequence marked by each sample voice in a sample voice set based on any one of a plurality of classification modes, the plurality of classification modes at least comprise characteristic clustering of the sample sub-word sequence, a voice recognition model of each language family is obtained by training a sample voice sub-set of each language family, and a sample voice sub-set of each language family is obtained by dividing the sample voice set based on a plurality of language families obtained by classification; and identifying the speech to be identified based on the speech identification model corresponding to the language family to which the target language belongs to obtain an identification text of the speech to be identified. By the scheme, the application cost of the voice recognition model can be reduced, and the recognition performance of the voice recognition model can be improved.

Description

Voice recognition method and related device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a speech recognition method, a related apparatus, an electronic device, and a storage medium.

Background

With the breakthrough of deep learning technology in the field of speech recognition, speech recognition has been widely applied in various industries such as education, entertainment, medical treatment, traffic, etc.

At present, a traditional speech recognition system usually needs to be modeled for each language independently, that is, each language needs to be trained independently to obtain a speech recognition model, and the speech recognition model of each language is deployed and maintained independently, which is very costly. In addition, for some low-resource languages, the recognition effect of the speech recognition model obtained by independent modeling is generally poor, and the landing requirement cannot be met. In view of the above, how to improve the recognition performance of the speech recognition model while reducing the application cost of the speech recognition model is an urgent problem to be solved.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a voice recognition method, a related device, an electronic device and a storage medium, which can improve the recognition performance of a voice recognition model while reducing the application cost of the voice recognition model.

In order to solve the above technical problem, a first aspect of the present application provides a speech recognition method, including: acquiring a target language to which a voice to be recognized belongs, and acquiring respective voice recognition models of a plurality of language families; the system comprises a plurality of language families, a plurality of classification modes and a plurality of voice recognition models, wherein the plurality of language families are obtained by analyzing a sample sub-word sequence labeled by each sample voice in a sample voice set based on any one of the plurality of classification modes, the plurality of classification modes at least comprise characteristic clustering on the sample sub-word sequence, the voice recognition model of each language family is obtained by training the sample voice sub-set of each language family respectively, and the sample voice sub-set of each language family is obtained by dividing the sample voice set based on the plurality of language families obtained by classification; and identifying the speech to be identified based on the speech identification model corresponding to the language family to which the target language belongs to obtain an identification text of the speech to be identified.

In order to solve the above technical problem, a second aspect of the present application provides a speech recognition apparatus, including: the system comprises a language acquisition module, a model acquisition module and a recognition module, wherein the language acquisition module is used for acquiring a target language to which a voice to be recognized belongs; the model acquisition module is used for acquiring respective voice recognition models of a plurality of languages; the method comprises the steps that a plurality of language families are obtained by analyzing a sample sub-word sequence marked by each sample voice in a sample voice set based on any one of a plurality of classification modes, the plurality of classification modes at least comprise characteristic clustering of the sample sub-word sequence, a voice recognition model of each language family is obtained by training a sample voice sub-set of each language family, and a sample voice sub-set of each language family is obtained by dividing the sample voice set based on a plurality of language families obtained by classification; and the recognition module is used for recognizing the speech to be recognized based on the speech recognition model corresponding to the language family to which the target language belongs to obtain a recognition text of the speech to be recognized.

In order to solve the above technical problem, a third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, where the memory stores program instructions, and the processor is configured to execute the program instructions to implement the speech recognition method of the first aspect.

In order to solve the above technical problem, a fourth aspect of the present application provides a computer-readable storage medium storing program instructions executable by a processor, the program instructions being for implementing the speech recognition method of the first aspect.

The above scheme obtains a target language to which a speech to be recognized belongs, and obtains respective speech recognition models of a plurality of language families, wherein the plurality of language families are obtained by analyzing a sample sub-word sequence labeled by each sample speech in a sample speech set based on any one of a plurality of classification modes, the plurality of classification modes at least comprise characteristic clustering of the sample sub-word sequence, the speech recognition model of each language family is respectively obtained by training the sample speech sub-set of each language family, the sample speech sub-set of each language family is obtained by dividing the sample speech set based on a plurality of language families obtained by classification, on the basis, the speech to be recognized is recognized based on the speech recognition model corresponding to the language family to which the target language belongs, so as to obtain a recognition text of the speech to be recognized, and the plurality of language families are obtained by analyzing the sample sub-word sequence labeled by each sample speech in the sample speech set based on any one of the plurality of classification modes, and the classification modes at least comprise characteristic clustering on the sample sub-word sequences, so that languages capable of being shared and modeled can be classified into the same language family as much as possible, on one hand, a voice recognition model does not need to be independently constructed for different languages, and the application cost of the voice recognition model is favorably reduced, on the other hand, because the voice recognition model of each language family can learn the common information between similar languages under the language family, the influence on the training voice recognition model when low-resource languages exist in the language family is favorably weakened as much as possible, and the recognition performance of the voice recognition model can be improved. Therefore, the application cost of the voice recognition model can be reduced, and the recognition performance of the voice recognition model can be improved.

Drawings

FIG. 1 is a schematic flow chart diagram of an embodiment of a speech recognition method of the present application;

FIG. 2 is a block diagram of an embodiment of a speech recognition model;

FIG. 3 is a block diagram of an embodiment of a dynamic language balancing policy;

FIG. 4 is a block diagram of an embodiment of a mask decoding strategy;

FIG. 5 is a block diagram of an embodiment of the speech recognition method of the present application;

FIG. 6 is a block diagram of an embodiment of a speech recognition apparatus according to the present application;

FIG. 7 is a block diagram of an embodiment of an electronic device of the present application;

FIG. 8 is a block diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, "plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a flowchart illustrating a speech recognition method according to an embodiment of the present application.

Specifically, the method may include the steps of:

step S11: and acquiring a target language to which the voice to be recognized belongs, and acquiring respective voice recognition models of a plurality of languages.

In one implementation scenario, the target language may be determined by interacting with a target object into which the speech to be recognized is input before the speech to be recognized is collected. Specifically, before the speech to be recognized is collected, a prompt may be output, and the prompt may include various language options, so that the target object may select one of the various language options, and may further determine the target language. For example, the target object may use a terminal device such as a smart phone, a tablet, a voice recorder pen, etc. to collect the voice to be recognized, and before the target object clicks "start collecting", a prompt may be output, which may include but is not limited to a prompt such as: the target object can select the language option 'English' from the language options, click 'start collection' after selection, and click 'end collection' after collection, so that the voice to be recognized is obtained, and the target language to which the voice to be recognized belongs can be determined to be 'English'. Other cases may be analogized and are not illustrated here.

In another implementation scenario, different from the aforementioned manner of determining the target language by interacting with the target object, the target language may also be obtained by performing language recognition on the speech to be recognized after the speech to be recognized is collected. The language identification can be obtained through the language identification model identification based on deep learning, such as a long-term and short-term memory network, a recurrent neural network, etc., and the network structure of the language identification model is not limited herein. In order to improve the recognition performance of the language recognition model, before performing language recognition on the speech to be recognized based on the language recognition model, a plurality of sample speeches can be collected, the sample speech can be marked with the sample language to which the sample speech belongs, and the language recognition is performed on the sample speech based on the language recognition model to obtain the predicted language to which the sample speech belongs. On the basis, the network parameters of the language identification model can be adjusted based on the difference between the sample language to which the sample voice belongs and the predicted language. Specifically, the specific measurement manner of the difference may refer to a loss function such as cross entropy, and a specific adjustment process of the parameter may refer to an optimization manner such as gradient descent, which is not described herein again.

In the embodiment of the present disclosure, the plurality of language families may be obtained by analyzing the sample sub-word sequences labeled by each sample voice in the sample voice set based on any one of a plurality of classification manners, and the plurality of classification manners at least include performing feature clustering on the sample sub-word sequences.

In one implementation scenario, sample voices in as many languages as possible may be collected in advance, and sample voices corresponding to different languages in the world may be collected, for example. In addition, for each language, it can be ensured that the total duration of the sample voice is not lower than a preset threshold (e.g., 100 hours, 200 hours, etc.) as much as possible in the collection process.

In one implementation scenario, the sample subword sequence may be obtained by segmenting subwords based on sample texts corresponding to sample voices. That is, a sample sub-word sequence may contain several sample sub-words. In particular, a word segmentation tool such as sense-piece may be used to segment the sample text corresponding to the sample speech, which is not limited herein. In addition, in order to distinguish the languages to which different subwords belong and to reduce the influence of multi-language confusion as much as possible, the sample subwords can be marked with the languages to which the sample subwords belong. For example, for the English text "hello", the following subwords can be segmented: h. e, l, o, and the subwords are labeled with the language label "en", the sample sequence of subwords for the english text "hello" can be represented as { h _ en, e _ en, l _ en, l _ en, o _ en }. Of course, to facilitate distinguishing between the beginning and the end of a sequence of sample subwords, the first sample subword in the sequence of sample subwords can be represented collectively as a start character, e.g., as<s>And the last sample subword in the sequence of sample subwords can be uniformly represented as an end character, e.g., as</s>. In this case, a sample subword sequence of the English text "hello" may representIs a<s>,h_en,e_en,l_en,l_en,o_en,</s>}. Other cases may be analogized, and no one example is given here. For convenience of description, the sample sub-word sequence labeled by the sample speech may be collectively represented as Y ═ Y ₀ ,…,y _i ,…y _I In which y _i Representing the ith character in the sample sub-word sequence, I +1 is the total number of sample sub-words in the sample sub-word sequence, y ₀ Is the first sample sub-word in the sample sub-word sequence, and as mentioned above, can be expressed as<s>，y _I Is the last sample sub-word in the sample sub-word sequence, as described above, can be expressed as</s>。

In an implementation scenario, semantic features of sample subwords in a sample subword sequence labeled by sample voices in a sample voice set can be obtained, feature clustering can be performed based on the semantic features of the sample subwords on the basis to obtain a plurality of feature sets, and for each feature set, each language family can be determined based on languages respectively related to the sample subwords to which the semantic features in the feature sets belong. In the above manner, each language family is determined by performing feature clustering on the semantic features of each sample sub-word, and each language family can be divided from the semantic level, so that different languages in each language family can be ensured to have similar attributes, and further, the speech recognition model can learn the common information between similar languages in the same language family in the subsequent training process.

In a specific implementation scenario, the semantic features of the sample subwords can be extracted by an extraction method such as LaBSE (Language-aware BERT sequence Embedding, i.e., multilingual BERT statement Embedding). For a specific extraction process, reference may be made to technical details of an extraction manner such as LaBSE, which are not described herein again.

In a specific implementation scenario, a Clustering manner such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise, Density-Based Noise application Spatial Clustering) may be adopted to perform feature Clustering, so as to obtain a plurality of feature sets. For a specific clustering process, the technical details of the clustering method such as DBSCAN can be referred to, and are not described herein again.

In a specific implementation scenario, for each feature set, the language to which the sample subword in the feature set belongs may be counted, and the counted language may be used as the language family corresponding to the feature set. For example, 3 feature sets can be obtained through feature clustering, and the language to which the sample subword labeled by each semantic feature in the 1 st feature set belongs includes: english, dutch, german, it can be determined that the language family corresponding to the 1 st feature set includes the following three languages: english, dutch, german; secondly, the language marked by the sample sub-word to which each semantic feature in the 2 nd feature set belongs includes: the language family corresponding to the 2 nd feature set can be determined to include the following four languages: iceland, denmark, norwegian, sweden; in addition, the language to which the sample sub-word labeled by each semantic feature in the 3 rd feature set belongs includes: if portuguese, spanish, french, or italian, it can be determined that the language family corresponding to the 3 rd feature set includes the following four languages: portuguese, spanish, french, italian. Other cases may be analogized, and no one example is given here.

In another implementation scenario, different from the classification manner of the feature clusters, the language family division may be performed based on prior knowledge. Specifically, the language family division may be performed based on the character writing method, language similarity, and the like of each language. For example, if four languages such as icelandic, denmark, norway, and swedish all belong to the north part of the japanese language family (i.e., scandinavia), the four languages such as icelandic, denmark, norway, and swedish can be classified as one language family, and if four languages such as portuguese, spanish, french, and italian all belong to the west roman language family of the romantic language family, the four languages such as portuguese, spanish, french, and italian can be classified as one language family. Other cases may be analogized, and no one example is given here.

In an implementation scenario, in order to facilitate subsequent model training and speech recognition after model training convergence, after dividing a plurality of language families, sample subwords belonging to the same language family may further constitute a preset dictionary of the language family. For example, if a language family includes 4 languages of islanding, denmark, norwegian and swedish, using the sequence of subwords with language labels in each of the 4 languages, a dictionary of each of the 4 languages can be obtained by counting the frequency of occurrence of the subwords, and can be represented as { icelanding }, { denmark }, { norwegian }, and { swedish }. Finally, the unified modeling dictionary in the language system is the union set { iceland language }. U { denmark language }. U { Norwegian language }. U { Swedish }. of the 4 dictionaries. Other language family conditions may be analogized, and are not exemplified herein. For convenience of description, the predetermined dictionary of the first language family may be denoted as Set1, the predetermined dictionary of the second language family may be denoted as Set2, …, the predetermined dictionary of the nth language family may be denoted as Set, and so on.

In the embodiment of the present disclosure, the speech recognition model of each language family is obtained by training a sample speech subset of each language family, and the sample speech subset of each language family is obtained by dividing a plurality of language families obtained by classifying the sample speech set. That is, after determining the language family to which each language belongs, the sample voices labeled with the same language family can be divided into the same sample voice subset, and based on the sample voice subset of each language family, the voice recognition models of each language family are obtained through training respectively.

In an implementation scenario, for each language family, an acoustic feature sequence of the sample speech may be obtained based on sample acoustic features of each speech frame of the sample speech in a sample speech subset of the language family, then coding is performed based on the acoustic feature sequence to obtain a coding feature sequence, feature quantization is performed based on the coding feature sequence to obtain a quantization feature sequence, and context characterization is performed based on the coding feature sequence to obtain a context feature sequence. On the basis, decoding can be carried out on the basis of the context feature sequence to obtain a predicted sub-word sequence, so that the network parameters of the speech recognition model of the language family can be adjusted on the basis of the contrast loss between the quantized feature sequence and the context feature sequence and the prediction loss of the predicted sub-word sequence compared with the sample sub-word sequence. In the method, in the process of training the speech recognition model, the speech recognition model is supervised and trained through the prediction loss of the prediction subword sequence compared with the sample subword sequence, and the speech recognition model is unsupervised and trained through the comparison loss between the quantitative characteristic sequence and the context characteristic sequence, so that the speech representation extracted by the speech recognition model can be better transferred to a downstream task by combining the supervised and unsupervised training modes, the generalization capability of the model on different scene data is favorably improved, and the recognition rate of low-resource languages and high-resource languages is favorably improved simultaneously.

In a specific implementation scenario, to further improve the model performance of the speech recognition model, before training the speech recognition model of each language family, a speech feature extraction network (specifically, obtained through unsupervised training) may be obtained based on speech data of several languages in a pre-training manner, where the speech feature extraction network and the decoding network form the speech recognition model, and the speech feature extraction network is used to perform encoding and context characterization, and the decoding network is used to perform decoding. Illustratively, the voice feature extraction network may include, but is not limited to, XLSR, etc., and is not limited thereto. It should be noted that XLSR is a network model obtained based on the wav2vec2 framework and based on the contrast loss (coherent loss) training, and it uses about 50 ten thousand hours of voice data in 53 languages in the pre-training process. The specific extraction process of the feature extraction network may refer to the technical details of network models such as XLSR, and is not described herein again. Further, decoding networks may include, but are not limited to: transformer Decoder, etc., without limitation thereto. The specific decoding process of the decoding network can refer to the technical details of a network model such as a transform Decoder, and is not described herein again. According to the method, before the speech recognition models of all languages are trained, the speech feature extraction network is obtained based on the pre-training of the speech data of a plurality of languages, the speech feature extraction network and the decoding network form the speech recognition model, the speech feature extraction network is used for executing coding and context characterization, the decoding network is used for executing decoding, after unsupervised pre-training is carried out on a large amount of speech data, the model can summarize certain potential prior distribution of the speech data, so that cross-language knowledge migration can be realized, the recognition performance of the speech recognition model on low-resource languages is facilitated to be improved, and the training difficulty of the speech recognition model is reduced as much as possible.

In a specific implementation scenario, the sample acoustic features of each speech frame of each sample speech may be extracted in advance, so as to obtain an acoustic feature sequence of the sample speech. It should be noted that the sample acoustic characteristics may include, but are not limited to: PLP (Perceptual Linear prediction), MFCC (Mel-scale frequency Cepstral Coefficients), Filter-Bank, etc., without limitation. For convenience of description, the acoustic feature sequence of the sample speech may be represented as X ═ { X ═ X ₁ ,…,x _j ,…,x _J In which x _j And representing the acoustic characteristics of the sample of the jth speech frame, wherein J is the total number of the speech frames of the sample speech. Further, the sample acoustic features may be represented, illustratively, as 40-dimensional Filter-Bank features.

In one specific implementation scenario, please refer to fig. 2 in combination, and fig. 2 is a schematic diagram of a framework of an embodiment of a speech recognition model. As shown in fig. 2, the speech feature extraction network may include an encoder and a comparison network, where the encoder may include, but is not limited to, a convolutional neural network, and the like, where the network structure of the encoder is not limited, and the encoder is configured to encode the sample acoustic features of each speech frame in the acoustic feature sequence to obtain sample encoding features, so that a combination of the sample encoding features of each speech frame may be used as the encoding feature sequence. In addition, the comparison network may include, but is not limited to, a Transformer Encoder and the like, a network structure of the comparison network is not limited herein, and the comparison network is configured to perform context characterization on each sample coding feature in the coding feature sequence to obtain a sample context feature of each speech frame, so that a combination of the sample context features of each speech frame may be used as a context feature sequence. For convenience of description, the sample acoustic feature of the ith speech frame may be denoted as X _i And recording the sample coding characteristic of the ith speech frame as Z _i And the sample context characteristic of the ith speech frame is marked as C _i The sample quantization characteristic of the ith speech frame is recorded as Q _i 。

In a specific implementation scenario, in order to further improve the network performance of the speech feature extraction network, in the feature quantization process, feature quantization may be performed on sample coding features of each speech frame in a coding feature sequence based on a pre-trained codebook to obtain a quantization feature sequence, where the quantization feature sequence includes sample quantization features of each speech frame. As shown in fig. 2, that is, during the feature quantization, no speech frame is masked. In contrast, before the context characterization, the sample coding features of at least one speech frame in the coding feature sequence may be masked, specifically, random masking or the like may be adopted, which is not limited herein. On this basis, context characterization can be performed based on the sample coding features of each speech frame in the masked coding feature sequence to obtain a context feature sequence, and the context feature sequence includes the sample context features of each speech frame. In the above manner, by masking the sample coding features of at least one speech frame in the coding feature sequence before the context characterization, since the loss of contrast between the quantized feature sequence and the context feature sequence is also incorporated in the training process of the speech recognition model, the coding features can be forced to be extracted as accurately as possible during the coding, and the context characterization can be forced to be performed as accurately as possible, so that the speech features can be extracted accurately even if the speech frame is masked.

In a specific implementation scenario, the specific calculation process of the contrast loss may refer to technical details of a loss function such as a coherent loss, which is not described herein again. In addition, decoding is carried out for a plurality of times based on the context characteristic sequence, so that a predicted subword corresponding to each decoding can be obtained, and the predicted subwords obtained by decoding in the past are combined to obtain a predicted subword sequence. It should be noted that, during each decoding, the prediction probability value of each preset sub-word in the preset dictionary corresponding to the language family to which the speech recognition model belongs may be obtained, so that the preset sub-word corresponding to the maximum prediction probability value may be used as the prediction sub-word corresponding to the current decoding. The predicted loss may be calculated based on the predicted probability value, and the specific calculation process may refer to technical details of a loss function such as cross entropy, which are not described herein again. In a specific implementation scenario, as described above, the predicted probability values of the preset sub-words in the preset dictionary corresponding to the language family to which the speech recognition model belongs can be obtained each time decoding is performed, and in order to prevent the modeling units of multiple languages from appearing due to poor speech quality as much as possible and to eliminate language crosstalk as much as possible, a non-language probability masking strategy can be adopted in the decoding stage, which may specifically refer to the following related description and will not be described herein again.

In a specific implementation scenario, after the contrast loss and the prediction loss are obtained, the contrast loss and the prediction loss may be weighted based on a first weight and a second weight, respectively, to obtain a sub-loss of the sample speech in the sample speech subset of the language family on the speech recognition model, where the first weight is not greater than the second weight. For example, the sum of the first weight and the second weight may be 1, the second weight may be set to 0.9, and the first weight may be set to 0.1, which is not limited herein. On the basis, the network parameters of the speech recognition model of the language family can be adjusted based on the respective sub-losses corresponding to at least one sample speech in the sample speech subset of the language family. For example, for a speech recognition model of each language family, several rounds of training may be performed, and each round of training may extract a sample speech batch (batch) based on a sample speech subset of the language family, for example, acoustic features of N (e.g., 5, 10, etc.) samples may be extracted from all acoustic feature sequences corresponding to the language family, so as to obtain a sample speech batch of the round of training. On the basis, during each round of training, for each sample voice in the sample voice batch, the sub-loss corresponding to the sample voice can be obtained through the calculation by the process, so that the training loss of the round of training can be obtained by combining the sub-losses of all the sample voices in the sample voice batch of the round of training, and then the network parameters of the voice recognition model of the language family are adjusted based on the training loss. It should be noted that, for the sub-loss of each sample voice in the sample voice batch of the training of the current round, the training loss of the training of the current round may be obtained through a manner such as weighting, and the following related description may be specifically referred to, and details are not repeated herein. In addition, the specific adjustment process of the network parameters may refer to an optimization manner such as gradient descent, which is not described herein again. In the above manner, the comparison loss and the prediction loss are weighted and summed based on the first weight and the second weight, respectively, to obtain the sub-loss of the sample speech on the speech recognition model in the sample speech subset of the language family, where the first weight is not greater than the second weight (for example, the sum of the first weight and the second weight may be 1), and then the network parameters of the speech recognition model of the language family are adjusted based on the sub-loss corresponding to at least one sample speech in the sample speech subset of the language family, so that the decoding accuracy can be ensured as much as possible while implementing the cross-language knowledge migration based on the comparison loss.

In one implementation scenario, as described above, for each language family, the speech recognition model of the language family may be trained based on sample speech batches in each round of training, and the sample speech batches are selected from sample speech subsets of the language family. In order to further improve the model performance of the speech recognition model, in each round of training the speech recognition model of the speech system, the loss weight of each sample speech in the sample speech batch may be predicted based on the weight prediction model, then the sub-losses of each sample speech in the sample speech batch are weighted based on the loss weight of each sample speech in the sample speech batch, so as to obtain the training loss of the training speech recognition model in the round, and the speech recognition model is trained in the round, and the network parameters are adjusted based on the training loss, then the verification loss of the speech recognition model after parameter adjustment on the verification set may be obtained, and then the network parameters of the weight prediction model are adjusted based on the distribution difference between the training loss and the verification loss. The next round of training can be analogized, and the process is repeated in this way, so that the detailed description is omitted. It should be noted that, for the specific calculation process of the sub-loss of each sample speech, reference may be made to the foregoing related description, and details are not repeated herein. In the above way, in each round of training process, the loss weight of each sample voice in the sample voice batch is predicted through the weight prediction model, and weighting the sub-losses of each sample voice in the sample voice batch based on the sub-losses to obtain the training loss of the training voice recognition model of the round, on the basis, the network parameters of the speech recognition model are adjusted, so that the network parameters of the weight prediction model are adjusted by verifying the distribution difference between the loss and the training loss, therefore, in each training process, the speech recognition model is optimized and adjusted through the training loss, and the weight prediction model is optimized and adjusted through the distribution difference, so that dynamic balance can be formed between low-resource languages and high-resource languages as far as possible, and the speech recognition model is favorable for improving the recognition performance as far as possible in both the low-resource languages and the high-resource languages.

In a specific implementation scenario, a portion of the data may be randomly separated from the training data as a validation set. It should be noted that the verification set does not participate in the training of the speech recognition model, and the function of the verification set is mainly to verify the performance of the model, and the distribution of the verification set is used to approximate the distribution of the test set.

In a specific implementation scenario, please refer to fig. 3 in combination, and fig. 3 is a schematic diagram of a framework of an embodiment of a dynamic language balancing policy. As shown in figure 3 of the drawings,

network parameters, theta, representing the weight prediction model during the t-th iterative training _t Representing the network parameters of the speech recognition model before parameter adjustment in the t-th iterative training,

represents the ith sample voice batch adopted in the t-th iterative training,

representing the loss of training based on the t-th iteration of training

For network parameter theta _t Obtained byGradient distribution of θ' _t+1 Network parameters representing the speech recognition model after parameter adjustment during the t-th iterative training, D _dev A set of the verifications is represented,

shows the validation-based loss J at the tth iterative training _dev (θ′ _t+1 ,D _dev ) To network parameter theta' _t+1 And (4) obtaining the gradient distribution. On the basis, the distribution difference between the two gradient distributions can be obtained based on a loss function such as KL Divergence (i.e. Kullback-Leibler Divergence). For a specific calculation manner, the technical details of the loss functions such as KL divergence and the like may be referred to, and are not described herein again. In the two-dimensional graph shown in fig. 3, the abscissa represents the language and the ordinate represents the number. As shown in fig. 3, after the loss weight predicted by the weight prediction model is balanced, a dynamic balance can be formed between the low resource language and the high resource language as much as possible in the process of training the speech recognition model. In addition, in order to further improve the prediction performance of the weight prediction model, after the distribution difference is measured, the distribution difference may be weighted with reference to the measurement manner of the training loss, and the weight may be set according to the training loss, for example, the weight may be set to 0.01, which is not limited herein.

In an implementation scenario, unlike the aforementioned joint training of the speech recognition model combining the supervised prediction loss and the unsupervised contrast loss, in the case that the requirement on the recognition performance of the speech recognition model is relaxed, the training may also be performed based on the prediction loss only in order to simplify the training process. Specifically, the acoustic feature sequence of the sample speech is obtained based on the sample acoustic features of each speech frame of the sample speech in the sample speech subset of the language family, the coding is performed based on the acoustic feature sequence to obtain a coding feature sequence, and the context feature sequence is obtained based on the context feature sequence to obtain a context feature sequence, so that the decoding is performed based on the context feature sequence to obtain a predicted sub-word sequence, and then the network parameters of the speech recognition model of the language family are adjusted based on the prediction loss of the predicted sub-word sequence compared with the sample sub-word sequence. For a more specific training process, reference may be made to the related description above, and details are not repeated here. In the case of training based on only the prediction loss, at least one of the dynamic language equalization strategy and the probability masking strategy may be adopted, which is not limited herein.

Step S12: and identifying the speech to be identified based on the speech identification model corresponding to the language family to which the target language belongs to obtain an identification text of the speech to be identified.

In an implementation scenario, after the speech recognition models of the respective language families are obtained through training, the speech to be recognized may be recognized based on the speech recognition model corresponding to the language family to which the target language belongs, so as to obtain a recognition result of the current recognition, where the recognition result may include: and the prediction probability value of each preset sub-word in the preset dictionary is preset, and the preset sub-word is marked with the language to which the preset sub-word belongs. On the basis, the subwords obtained by the current recognition can be obtained based on the recognition result, and the recognition texts can be obtained based on the subwords obtained by the past recognition.

In a specific implementation scenario, each time of recognition, a preset subword corresponding to the maximum prediction probability value in the recognition result may be used as the subword obtained by the current recognition, so that when the subword obtained by a certain recognition is an end character (e.g., </s >), the recognition end can be confirmed. On the basis, the language marked in the sub-words obtained by the past recognition can be removed, and the sub-words without the marks are combined to obtain the recognition text corresponding to the voice to be recognized.

In a specific implementation scenario, the "predetermined dictionary" referred to by the recognition result may be a predetermined dictionary corresponding to a language family to which the target language belongs. The specific meaning of the preset dictionary can refer to the related description, and is not described herein again.

In an implementation scenario, different from the aforementioned preset subword directly corresponding to the maximum prediction probability value as the subword obtained by the current recognition, as described above, in order to prevent a modeling unit of multiple languages due to poor voice quality as much as possible and eliminate language crosstalk as much as possible, a non-language probability masking strategy may be adopted in the decoding stage. Specifically, the speech to be recognized may be recognized based on the speech recognition model corresponding to the language family to which the target language belongs, so as to obtain a recognition result of this recognition, where the recognition result includes: and the prediction probability value of each preset sub-word in the preset dictionary is preset, and the preset sub-word is marked with the language to which the preset sub-word belongs. On the basis, preset sub-words of which the language is different from the target language can be used as target sub-words, the prediction probability value of the target sub-words in the recognition result is suppressed, so that the sub-words obtained by current recognition can be obtained on the basis of the latest recognition result, and the recognized text can be obtained on the basis of the sub-words obtained by previous recognition. In the above manner, the preset sub-words with the language different from the target language are used as the target sub-words, and the prediction probability values of the target sub-words in the recognition results are suppressed, so that the modeling units with multiple languages due to poor voice quality can be prevented as much as possible, language crosstalk can be eliminated as much as possible, and the accuracy of voice recognition can be improved.

In a specific implementation scenario, the prediction probability value may be calculated by log-softmax based on the output result of the decoding network, and the size of the prediction probability value may be represented as (batch _ size, vocab _ size), where batch _ size represents the size of the sample voice batch, and vocab _ size represents the size of the preset dictionary. On the basis, a masking matrix with the same size can be generated, the predicted probability value of each preset sub-word in the preset dictionary included in the recognition result can be multiplied by the masking matrix, and the masking value corresponding to the target sub-word in the masking matrix can be set to be positive infinity, while the masking value not corresponding to the target sub-word can be set to be 1. That is, for a preset subword not belonging to the target language (i.e., the target subword), the predicted probability value thereof may be multiplied by infinity to impose a penalty on the predicted probability value of the preset subword not belonging to the target language (i.e., the target subword), the predicted probability value of the preset subword not belonging to the current decoding language (i.e., the target language) may be suppressed low, and conversely, for the preset subword belonging to the target language, the predicted probability value thereof may be multiplied by 1, i.e., the preset probability value of the preset subword of the current decoding language (i.e., the target language) may be maintained constant.

In one specific implementation scenario, please refer to fig. 4 in combination, and fig. 4 is a block diagram illustrating an embodiment of a mask decoding strategy. As shown in fig. 4, the languages of the 4 th preset sub-word and the 7 th preset sub-word are lang2 and lang3, respectively, and the target language is lang1, so that the 4 th element and the 7 th element in the masking matrix can be set to be positive infinity, and the other elements are set to be 1, and the recognition result is processed based on the masking matrix, so as to obtain the latest recognition result. On the basis, a Beam Search decoding strategy can be executed to obtain a recognition text without crosstalk.

In an implementation scenario, please refer to fig. 5, wherein fig. 5 is a schematic diagram of a frame of an embodiment of a speech recognition method according to the present application. As shown in fig. 5, in the training stage, a sample sub-word sequence labeled by sample voices in the sample voice set is extracted to obtain semantic features of each sample sub-word, and feature clustering is performed on the basis of the sub-words, so as to obtain a plurality of language families by division: language family 1, … …, language family M, … …, language family M. On the basis, the speech recognition models of the M language systems can be obtained through training respectively. For example, for the mth language family, the speech recognition model of the mth language family may be obtained through training based on the acoustic feature sequence and the sample subword sequence of each sample speech in the sample speech subset of the language family by using a supervised and unsupervised joint training manner. It should be noted that, in the training process, a dynamic language balancing strategy may be adopted to balance the low resource language and the high resource language. After the training convergence, for the speech to be recognized with the language category as the target language, the speech to be recognized may be recognized by using the speech recognition model of the language family to which the target language belongs, so as to obtain a recognition text. In addition, in order to prevent the modeling units of multiple languages from appearing due to poor voice quality as much as possible, and to eliminate language crosstalk as much as possible, a masking decoding strategy may be adopted in the decoding stage of voice recognition.

Referring to fig. 6, fig. 6 is a schematic block diagram of a speech recognition device 60 according to an embodiment of the present application. The speech recognition device 60 includes: the system comprises a language acquisition module 61, a model acquisition module 62 and a recognition module 63, wherein the language acquisition module 61 is used for acquiring a target language to which a voice to be recognized belongs; a model obtaining module 62, configured to obtain respective speech recognition models of a plurality of languages; the system comprises a plurality of language families, a plurality of classification modes and a plurality of voice recognition models, wherein the plurality of language families are obtained by analyzing a sample sub-word sequence labeled by each sample voice in a sample voice set based on any one of the plurality of classification modes, the plurality of classification modes at least comprise characteristic clustering on the sample sub-word sequence, the voice recognition model of each language family is obtained by training the sample voice sub-set of each language family respectively, and the sample voice sub-set of each language family is obtained by dividing the sample voice set based on the plurality of language families obtained by classification; and the recognition module 63 is configured to recognize the speech to be recognized based on the speech recognition model corresponding to the language family to which the target language belongs, so as to obtain a recognition text of the speech to be recognized.

According to the scheme, the plurality of language families are obtained by analyzing the sample sub-word sequences marked by the sample voices in the sample voice set based on any one of the plurality of classification modes, and the plurality of classification modes at least comprise characteristic clustering on the sample sub-word sequences, so that languages capable of being shared and modeled can be classified into the same language family as much as possible, on one hand, voice recognition models do not need to be established independently for different languages, and the application cost of the voice recognition models is reduced. Therefore, the application cost of the voice recognition model can be reduced, and the recognition performance of the voice recognition model can be improved.

In some disclosed embodiments, the model obtaining module 62 includes a feature extraction sub-module, configured to obtain an acoustic feature sequence of the sample speech based on sample acoustic features of each speech frame of the sample speech in the sample speech subset of the language family; the model obtaining module 62 includes a feature coding submodule, configured to perform coding based on the acoustic feature sequence to obtain a coding feature sequence; the model obtaining module 62 includes a feature quantization submodule, configured to perform feature quantization based on the encoded feature sequence to obtain a quantized feature sequence; the model obtaining module 62 includes a context characterization sub-module, configured to perform context characterization based on the coding feature sequence to obtain a context feature sequence; the model obtaining module 62 includes a decoding submodule, configured to perform decoding based on the context feature sequence to obtain a predicted subword sequence; the model obtaining module 62 includes a parameter adjusting sub-module, configured to adjust network parameters of the speech recognition model of the language family based on a comparison loss between the quantized feature sequence and the context feature sequence and a prediction loss of the predicted sub-word sequence compared to the sample sub-word sequence.

Therefore, in the process of training the speech recognition model, the speech recognition model is subjected to supervised training through the prediction loss of the prediction sub-word sequence compared with the sample sub-word sequence, and the speech recognition model is subjected to unsupervised training through the comparison loss between the quantitative characteristic sequence and the context characteristic sequence, so that the speech representation extracted by the speech recognition model can be better transferred to a downstream task through combining two training modes of supervision and unsupervised training, the generalization capability of the model on different scene data is favorably improved, and the recognition rate of low-resource and high-resource languages is favorably improved at the same time.

In some disclosed embodiments, before training the speech recognition model of each language family, a speech feature extraction network is obtained based on pre-training of speech data of several languages, and the speech feature extraction network and a decoding network form the speech recognition model, and the speech feature extraction network is used for performing encoding and context characterization, and the decoding network is used for performing decoding.

Therefore, before the speech recognition models of all language families are trained, the speech feature extraction network is obtained based on the pre-training of the speech data of a plurality of languages, the speech feature extraction network and the decoding network form the speech recognition model, the speech feature extraction network is used for executing coding and context representation, and the decoding network is used for executing decoding, so that the model performance of the speech recognition model can be improved, and the training difficulty of the speech recognition model can be reduced as much as possible.

In some disclosed embodiments, the feature quantization sub-module is specifically configured to perform feature quantization on sample coding features of each speech frame in the coding feature sequence based on a pre-trained codebook to obtain a quantized feature sequence; wherein, the quantization characteristic sequence comprises the sample quantization characteristics of each voice frame; the model obtaining module 62 includes a speech frame masking sub-module, configured to mask a sample coding feature of at least one speech frame in the coding feature sequence; the context characterization submodule is specifically used for performing context characterization on the basis of the sample coding features of each speech frame in the masked coding feature sequence to obtain a context feature sequence; wherein the context feature sequence comprises sample context features of each speech frame.

Therefore, by masking the sample coding features of at least one speech frame in the coding feature sequence before the context characterization, since the loss of contrast between the quantization feature sequence and the context feature sequence is also included in the training process of the speech recognition model, the coding features can be forced to be extracted as accurately as possible during the coding, and the context characterization can be forced to be performed as accurately as possible, so that the speech features can be extracted accurately even if the speech frame is masked.

In some disclosed embodiments, the parameter adjusting submodule includes a weighting unit, configured to weight the comparison loss and the prediction loss based on the first weight and the second weight, respectively, to obtain a sub-loss of a sample speech in a sample speech subset of a language family on the speech recognition model; wherein the first weight is not greater than the second weight; the parameter adjusting submodule comprises an adjusting unit, and is used for adjusting network parameters of a speech recognition model of the language family based on the respective corresponding sub-loss of at least one sample speech in the sample speech subset of the language family.

Therefore, the comparison loss and the prediction loss are weighted respectively based on the first weight and the second weight to obtain the sub-loss of the sample voice on the voice recognition model in the sample voice subset of the language family, the first weight is not more than the second weight, and the network parameters of the voice recognition model of the language family are adjusted based on the sub-loss corresponding to at least one sample voice in the sample voice subset of the language family, so that the language knowledge migration based on the comparison loss can be realized, and the decoding accuracy is ensured as much as possible.

In some disclosed embodiments, the phonetic recognition model of the language family is trained based on sample speech batches in each round of training, the sample speech batches being selected from a subset of sample speech of the language family, and the speech recognition apparatus 60 further comprises a weight prediction module for predicting a loss weight of each sample speech in the sample speech batches based on the weight prediction model in each round of training the phonetic recognition model of the language family; the speech recognition device 60 further includes a loss weighting module, configured to weight the sub-loss of each sample speech in the sample speech batch based on the loss weight of each sample speech in the sample speech batch, so as to obtain the training loss of the training speech recognition model in the current round; wherein, the speech recognition model is trained in the current round, and network parameters are adjusted based on training loss; the speech recognition device 60 further comprises a model verification module for obtaining a verification loss of the parameter-adjusted speech recognition model on a verification set; the speech recognition device 60 further comprises a model optimization module for adjusting network parameters of the weighted prediction model based on a distribution difference between the training loss and the validation loss.

Therefore, in each round of training, through the double-layer optimization of optimizing and adjusting the speech recognition model through the training loss and optimizing and adjusting the weight prediction model through the distribution difference, dynamic balance can be formed between the low-resource language and the high-resource language as far as possible, and the method is beneficial to enabling the speech recognition model to improve the recognition performance as far as possible in both the low-resource language and the high-resource language.

In some disclosed embodiments, the recognition module 63 includes a recognition submodule, configured to recognize, based on a speech recognition model corresponding to a language family to which a target language belongs, a speech to be recognized, and obtain a recognition result of the current recognition; wherein, the recognition result includes: the prediction probability value of each preset sub-word in the dictionary is preset, and the preset sub-word is marked with the language to which the preset sub-word belongs; the recognition module 63 includes a selection sub-module, configured to use a preset sub-word with a language different from the target language as a target sub-word; the recognition module 63 further includes a suppression submodule, configured to suppress the predicted probability value of the target subword in the recognition result; the recognition module 63 further includes a determining submodule, configured to obtain a subword obtained by the current recognition based on the latest recognition result; the recognition module 63 further includes a combination sub-module, which is configured to obtain a recognized text based on the subwords obtained through the past recognition.

Therefore, by taking the preset sub-words with the language different from the target language as the target sub-words and inhibiting the prediction probability value of the target sub-words in the recognition result, the modeling units with multiple languages due to poor voice quality can be prevented as much as possible, so that language crosstalk is eliminated as much as possible, and the accuracy of voice recognition is improved.

In some disclosed embodiments, the sample sub-word sequence is obtained by sub-word segmentation of a sample text corresponding to the sample speech, each sample sub-word in the sample sub-word sequence is marked with a language to which the sample sub-word belongs, and the sample sub-words belonging to the same language family in the language family form a preset dictionary of the language family.

Therefore, by constructing preset dictionaries of different language families, information sharing of similar languages under the same language family can be realized.

In some disclosed embodiments, the speech recognition apparatus 60 further includes a semantic extraction module, configured to obtain semantic features of each sample sub-word in a sample sub-word sequence labeled by each sample speech in the sample speech set; the speech recognition device 60 further comprises a feature clustering module for performing feature clustering based on semantic features of the sample sub-words to obtain a plurality of feature sets; the speech recognition apparatus 60 further includes a language family determining module, configured to determine, for each feature set, each language family based on the language type respectively related to the sample subword to which the semantic feature in the feature set belongs.

Therefore, each language family can be divided from the semantic level by performing feature clustering on the semantic features of each sample sub-word to determine each language family, so that different languages in each language family can be ensured to have similar attributes, and further, the speech recognition model can learn the common information between similar languages in the same language family in the subsequent training process.

Referring to fig. 7, fig. 7 is a schematic diagram of a frame of an embodiment of an electronic device 70 according to the present application. The electronic device 70 comprises a memory 71 and a processor 72 coupled to each other, wherein the memory 71 stores program instructions, and the processor 72 is configured to execute the program instructions to implement the steps in any of the above-mentioned embodiments of the speech recognition method. Specifically, the electronic device 70 may include, but is not limited to: desktop computers, notebook computers, servers, mobile phones, tablet computers, and the like, without limitation.

In particular, the processor 72 is configured to control itself and the memory 71 to implement the steps in any of the above-described embodiments of the speech recognition method. The processor 72 may also be referred to as a CPU (Central Processing Unit). The processor 72 may be an integrated circuit chip having signal processing capabilities. The Processor 72 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. Additionally, the processor 72 may be collectively implemented by an integrated circuit chip.

Referring to fig. 8, fig. 8 is a block diagram illustrating an embodiment of a computer readable storage medium 80 according to the present application. The computer readable storage medium 80 stores program instructions 81 executable by the processor, the program instructions 81 being for implementing the steps in any of the speech recognition method embodiments described above.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

If the technical scheme of the application relates to personal information, a product applying the technical scheme of the application clearly informs personal information processing rules before processing the personal information, and obtains personal independent consent. If the technical scheme of the application relates to sensitive personal information, before the sensitive personal information is processed, a product applying the technical scheme of the application obtains individual consent and simultaneously meets the requirement of 'explicit consent'. For example, at a personal information collection device such as a camera, a clear and significant identifier is set to inform that the personal information collection range is entered, the personal information is collected, and if the person voluntarily enters the collection range, the person is regarded as agreeing to collect the personal information; or on the device for processing the personal information, under the condition of informing the personal information processing rule by using obvious identification/information, obtaining personal authorization by modes of popping window information or asking a person to upload personal information of the person by himself, and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing method, and a type of personal information to be processed.

Claims

1. A speech recognition method, comprising:

acquiring a target language to which a voice to be recognized belongs, and acquiring respective voice recognition models of a plurality of languages; the system comprises a plurality of language systems, a plurality of speech recognition models and a plurality of language systems, wherein the plurality of language systems are obtained by analyzing a sample sub-word sequence labeled by each sample speech in a sample speech set based on any one of a plurality of classification modes, the plurality of classification modes at least comprise characteristic clustering on the sample sub-word sequence, the speech recognition model of each language system is obtained by training the sample speech sub-set of each language system, and the sample speech sub-set of each language system is obtained by dividing the sample speech set based on a plurality of language systems obtained by classification;

and identifying the voice to be identified based on the voice identification model corresponding to the language family of the target language to obtain an identification text of the voice to be identified.

2. The method according to claim 1, wherein said obtaining respective speech recognition models for a plurality of speech systems comprises:

obtaining an acoustic feature sequence of the sample voice based on the sample acoustic features of each voice frame of the sample voice in the sample voice subset of the language family;

coding based on the acoustic feature sequence to obtain a coding feature sequence, performing feature quantization based on the coding feature sequence to obtain a quantization feature sequence, and performing context characterization based on the coding feature sequence to obtain a context feature sequence;

decoding is carried out on the basis of the context characteristic sequence to obtain a predicted subword sequence;

and adjusting network parameters of the speech recognition model of the language family based on the contrast loss between the quantized feature sequence and the context feature sequence and the prediction loss of the predicted sub-word sequence compared with the sample sub-word sequence.

3. The method according to claim 2, wherein before training the speech recognition model of each of the language families, a speech feature extraction network is obtained based on pre-training speech data of several languages, and the speech feature extraction network and a decoding network compose the speech recognition model, and the speech feature extraction network is used for performing the encoding and the context characterization, and the decoding network is used for performing the decoding.

4. The method of claim 2, wherein the performing feature quantization based on the coded feature sequence to obtain a quantized feature sequence comprises:

performing characteristic quantization on the sample coding characteristics of each speech frame in the coding characteristic sequence based on a pre-trained codebook to obtain a quantization characteristic sequence; wherein the quantization characteristic sequence comprises a sample quantization characteristic of each speech frame;

before performing context characterization based on the coding feature sequence to obtain a context feature sequence, the method further includes:

masking the sample coding characteristics of at least one speech frame in the coding characteristic sequence;

performing context characterization based on the coding feature sequence to obtain a context feature sequence, including:

performing context characterization based on the sample coding features of the speech frames in the masked coding feature sequence to obtain a context feature sequence; wherein the context feature sequence comprises a sample context feature of each of the speech frames.

5. The method according to claim 2, wherein said adjusting network parameters of said speech recognition model of the language family based on a loss of contrast between said quantized feature sequence and said context feature sequence and a loss of prediction of said predicted sub-word sequence compared to said sample sub-word sequence comprises:

weighting the comparison loss and the prediction loss respectively based on a first weight and a second weight to obtain the sub-loss of the sample voice on the voice recognition model in the sample voice subset of the language family; wherein the first weight is not greater than the second weight;

and adjusting the network parameters of the speech recognition model of the language family based on the respective corresponding sub-losses of at least one sample speech in the sample speech subset of the language family.

6. The method of claim 1, wherein the phonetic recognition model of the language family is trained in each round of training based on a sample speech batch selected from a subset of sample speech of the language family, and wherein the method further comprises, in each round of training the phonetic recognition model of the language family:

predicting a loss weight of each of the sample voices in the sample voice batch based on a weight prediction model;

weighting the sub-loss of each sample voice in the sample voice batch based on the loss weight of each sample voice in the sample voice batch to obtain the training loss of the current round of training the voice recognition model; wherein the speech recognition model is trained in a current round, and network parameters are adjusted based on the training loss;

obtaining the verification loss of the voice recognition model after parameter adjustment on a verification set;

adjusting network parameters of the weighted prediction model based on a distribution difference between the training loss and the validation loss.

7. The method according to claim 1, wherein the recognizing the speech to be recognized based on the speech recognition model corresponding to the language family to which the target language belongs to obtain a recognition text of the speech to be recognized includes:

identifying the voice to be identified based on the voice identification model corresponding to the language family of the target language to obtain an identification result of the identification; wherein the recognition result comprises: presetting the prediction probability value of each preset sub-word in a dictionary, wherein the preset sub-word is marked with the language to which the preset sub-word belongs;

taking preset sub-words of which the languages are different from the target language as target sub-words, and inhibiting the prediction probability value of the target sub-words in the recognition result;

and obtaining the sub-words obtained by the current recognition based on the latest recognition result, and obtaining the recognition text based on the sub-words obtained by the past recognition.

8. The method according to claim 7, wherein the sample subword sequence is obtained by segmenting subwords from sample texts corresponding to the sample voices, each sample subword in the sample subword sequence is marked with a language to which the sample subword belongs, and the sample subwords of the same language family in the language family constitute a predetermined dictionary of the language family.

9. The method according to claim 1, wherein prior to said obtaining respective speech recognition models for a plurality of languages, said method further comprises:

obtaining semantic features of each sample sub-word in a sample sub-word sequence labeled by each sample voice in the sample voice set;

performing feature clustering based on semantic features of the sample sub-words to obtain a plurality of feature sets;

and for each feature set, determining each language family based on the language type respectively related to the sample subwords to which the semantic features belong in the feature set.

10. A speech recognition apparatus, comprising:

the language acquisition module is used for acquiring a target language to which the voice to be recognized belongs;

the model acquisition module is used for acquiring respective voice recognition models of a plurality of languages; the method comprises the steps that a plurality of language families are obtained by analyzing a sample sub-word sequence marked by each sample voice in a sample voice set based on any one of a plurality of classification modes, the plurality of classification modes at least comprise characteristic clustering on the sample sub-word sequence, a voice recognition model of each language family is obtained by training a sample voice sub-set of each language family, and a sample voice sub-set of each language family is obtained by dividing the sample voice set based on a plurality of language families obtained by classification;

and the recognition module is used for recognizing the speech to be recognized based on the speech recognition model corresponding to the language family to which the target language belongs to obtain a recognition text of the speech to be recognized.

11. An electronic device comprising a memory and a processor coupled to each other, the memory having stored therein program instructions, the processor being configured to execute the program instructions to implement the speech recognition method of any one of claims 1 to 9.

12. A computer-readable storage medium, characterized in that program instructions executable by a processor for implementing the speech recognition method of any one of claims 1 to 9 are stored.