CN110808035B

CN110808035B - Method and apparatus for training hybrid language recognition models

Info

Publication number: CN110808035B
Application number: CN201911075308.1A
Authority: CN
Inventors: 袁胜龙
Original assignee: Baidu Online Network Technology Beijing Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd
Priority date: 2019-11-06
Filing date: 2019-11-06
Publication date: 2021-11-26
Anticipated expiration: 2039-11-06
Also published as: CN110808035A

Abstract

The embodiment of the application discloses a method and a device for training a hybrid language recognition model. One embodiment of the method comprises: generating a first syllable tag sequence of the first language audio and a second syllable tag sequence of the second language audio; processing a second language audio and a second syllable label sequence by using a pre-trained first language identification model to obtain a connecting time sequence classification Viterbi sequence; determining a connection timing classification viterbi score for each second syllable label in the sequence of second syllable labels based on the connection timing classification viterbi sequence; determining a difference syllable label from the second syllable label sequence based on the determined connection timing classification viterbi score; and performing mixed training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a mixed language identification model. The embodiment realizes that the same model supports the recognition of multiple languages.

Description

Method and apparatus for training hybrid language recognition models

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for training a hybrid language recognition model.

Background

With the development of voice recognition technology, the performance of voice recognition has been satisfactory and practical, for example, various input methods on mobile phones have a voice interaction function. In practical applications, there is speech recognition in dialect scenarios in addition to speech recognition in mandarin scenarios. At present, many speech interaction products supporting dialect speech recognition exist, for example, speech recognition selectable items on a mobile phone input method, and a user can select a corresponding dialect according to needs, for example, some smart televisions, smart refrigerators and the like which are made for a specific dialect.

In the related art, a mandarin chinese recognition model is usually adopted to perform speech recognition on mandarin chinese, a corresponding dialect recognition model is adopted to perform speech recognition on dialect, and when a user switches languages, the user needs to select the corresponding speech recognition model back and forth, which is tedious to operate. Moreover, as the supported dialects are more and more, the number of dialect recognition models needing to be trained is more and more, so that the workload of model training is higher.

Disclosure of Invention

The embodiment of the application provides a method and a device for training a hybrid language recognition model.

In a first aspect, an embodiment of the present application provides a method for training a hybrid language recognition model, including: generating a first syllable tag sequence of the first language audio and a second syllable tag sequence of the second language audio; processing a second language audio and a second syllable label sequence by utilizing a pre-trained first language identification model to obtain a connecting time sequence classification Viterbi sequence, wherein the first language identification model is obtained by training based on the first syllable label sequence; determining a connection timing classification viterbi score for each second syllable label in the sequence of second syllable labels based on the connection timing classification viterbi sequence; determining a difference syllable label from the second syllable label sequence based on the determined connection timing classification viterbi score; and performing mixed training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a mixed language identification model.

In some embodiments, generating a first syllable tag sequence for the first language audio comprises: extracting mel frequency cepstrum coefficient characteristics of the first language audio; and training the Gaussian mixture model based on the Mel frequency cepstrum coefficient characteristics and the text corresponding to the first language audio to obtain an aligned Gaussian mixture model and a first syllable label sequence.

In some embodiments, generating a second syllable tag sequence for the second language audio comprises: and inputting the second language audio into the aligned Gaussian mixture model to obtain a second syllable label sequence, wherein the label number of the second syllable label sequence is equal to that of the first syllable label sequence.

In some embodiments, processing the second language audio and the second syllable label sequence using the pre-trained first language identification model to obtain a concatenated sequential classification viterbi sequence comprises: removing the weight of the second syllable label sequence to obtain a removed weight syllable label sequence; removing the mute frame from the de-stressed syllable label sequence to obtain an effective syllable label sequence; inserting a space into the effective syllable label sequence to obtain a space-inserted syllable label sequence; and inputting the second language audio and the inserting syllabic label sequence into the first language identification model to obtain a connecting time sequence classification Viterbi sequence.

In some embodiments, determining the concatenated timing classification viterbi score for each second syllable label in the sequence of second syllable labels based on the concatenated timing classification viterbi sequence comprises: for each valid syllable label in the sequence of valid syllable labels, a concatenated timing class Viterbi score for the valid syllable label is determined based on the position of the valid syllable label in the concatenated timing class Viterbi sequence.

In some embodiments, classifying the viterbi score based on the determined connection timing, determining a difference syllable label from the second syllable label sequence, comprises: and determining the effective syllable label with the connecting time sequence classification Viterbi score smaller than a preset threshold value in the effective syllable label sequence as a difference syllable label.

In some embodiments, the first language identification model is trained by: extracting the filter bank coefficient characteristics of the first language audio; and training the deep neural network based on the filter bank coefficient characteristics and the first syllable label sequence to obtain a first language identification model, wherein the node number of an output layer of the first language identification model is equal to the label number of the first syllable label sequence.

In some embodiments, the hybrid training of the deep neural network based on the first syllable label sequence and the difference syllable labels results in a hybrid language recognition model, comprising: and performing mixed training on the deep neural network based on the filter bank coefficient characteristics, the first syllable label sequence and the difference syllable labels to obtain a mixed language identification model, wherein the node number of an output layer of the mixed language identification model is equal to the sum of the label number of the first syllable label sequence and the label number of the difference syllable labels.

In some embodiments, training the first or hybrid language recognition model optimizes the deep neural network using training criteria based on connection timing classification.

In a second aspect, an embodiment of the present application provides an apparatus for training a hybrid language recognition model, including: a generating unit configured to generate a first syllable tag sequence of a first language audio and a second syllable tag sequence of a second language audio; the processing unit is configured to process the second language audio and the second syllable label sequence by utilizing a pre-trained first language identification model to obtain a connection time sequence classification Viterbi sequence, wherein the first language identification model is obtained by training based on the first syllable label sequence; a first determining unit configured to determine a connection timing classification viterbi score for each of the second syllable labels in the second syllable label sequence based on the connection timing classification viterbi sequence; a second determining unit configured to determine a difference syllable label from the second syllable label sequence based on the determined connection timing classification viterbi score; and the training unit is configured to perform mixed training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a mixed language recognition model.

In some embodiments, the generating unit is further configured to: extracting mel frequency cepstrum coefficient characteristics of the first language audio; and training the Gaussian mixture model based on the Mel frequency cepstrum coefficient characteristics and the text corresponding to the first language audio to obtain an aligned Gaussian mixture model and a first syllable label sequence.

In some embodiments, the generating unit is further configured to: and inputting the second language audio into the aligned Gaussian mixture model to obtain a second syllable label sequence, wherein the label number of the second syllable label sequence is equal to that of the first syllable label sequence.

In some embodiments, the processing unit is further configured to: removing the weight of the second syllable label sequence to obtain a removed weight syllable label sequence; removing the mute frame from the de-stressed syllable label sequence to obtain an effective syllable label sequence; inserting a space into the effective syllable label sequence to obtain a space-inserted syllable label sequence; and inputting the second language audio and the inserting syllabic label sequence into the first language identification model to obtain a connecting time sequence classification Viterbi sequence.

In some embodiments, the first determination unit is further configured to: for each valid syllable label in the sequence of valid syllable labels, a concatenated timing class Viterbi score for the valid syllable label is determined based on the position of the valid syllable label in the concatenated timing class Viterbi sequence.

In some embodiments, the second determination unit is further configured to: and determining the effective syllable label with the connecting time sequence classification Viterbi score smaller than a preset threshold value in the effective syllable label sequence as a difference syllable label.

In some embodiments, the training unit is further configured to: and performing mixed training on the deep neural network based on the filter bank coefficient characteristics, the first syllable label sequence and the difference syllable labels to obtain a mixed language identification model, wherein the node number of an output layer of the mixed language identification model is equal to the sum of the label number of the first syllable label sequence and the label number of the difference syllable labels.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.

In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which, when executed by a processor, implements the method as described in any implementation manner of the first aspect.

According to the method and the device for training the mixed language recognition model, firstly, a first syllable label sequence of a first language audio and a second syllable label sequence of a second language audio are generated; then, a first language identification model pre-trained based on the first syllable label sequence is used for processing a second language audio and a second syllable label sequence to obtain a connecting time sequence classification Viterbi sequence; then based on the connection time sequence classification Viterbi sequence, determining the connection time sequence classification Viterbi score of each second syllable label in the second syllable label sequence; then based on the determined connecting time sequence classification Viterbi score, determining a difference syllable label from the second syllable label sequence; and finally, performing mixed training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a mixed language recognition model. The different syllable labels are determined from the second syllable label sequence based on the connecting time sequence classification Viterbi score, and the mixed language recognition model is trained based on the first syllable label sequence and the different syllable labels, so that the model training workload is reduced. In addition, the mixed language recognition model is trained based on the syllable labels of multiple languages, and the recognition that the same model supports multiple languages is realized. In addition, the user does not need to switch among a plurality of models, and the user operation is simplified.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture to which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for training a hybrid language recognition model according to the present application;

FIG. 3 is a schematic diagram of the structure of a hybrid language recognition model;

FIG. 4 is a flow diagram of yet another embodiment of a method for training a hybrid language recognition model according to the present application;

FIG. 5 is a schematic diagram of a Viterbi connected by a time sequence classification using a first language identification model;

FIG. 6 is a schematic diagram illustrating an architecture of one embodiment of an apparatus for training a hybrid language recognition model according to the present application;

FIG. 7 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the present method for training a hybrid language recognition model or an apparatus for training a hybrid language recognition model may be applied.

As shown in fig. 1, a system architecture 100 may include a terminal device 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. Various communication client applications, such as a voice recognition application, etc., may be installed on the terminal device 101.

The terminal apparatus 101 may be hardware or software. When the terminal apparatus 101 is hardware, it may be various electronic apparatuses. Including but not limited to smart phones, tablets, smart speakers, smart televisions, smart refrigerators, and the like. When the terminal apparatus 101 is software, it can be installed in the above-described electronic apparatus. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.

The server 103 may provide various services. For example, the server 103 may analyze and process data such as the first language audio and the second language audio acquired from the terminal apparatus 101 and generate a processing result (e.g., a hybrid language identification model).

The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for training the hybrid language recognition model provided in the embodiment of the present application is generally performed by the server 103, and accordingly, the apparatus for training the hybrid language recognition model is generally disposed in the server 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for training a hybrid language recognition model in accordance with the present application is illustrated. The method for training the hybrid language recognition model comprises the following steps:

step 201, a first syllable label sequence of a first language audio and a second syllable label sequence of a second language audio are generated.

In this embodiment, an executing agent (e.g., server 103 shown in fig. 1) of a method for training a hybrid speech recognition model may generate a first syllable tag sequence of a first language audio and a second syllable tag sequence of a second language audio.

Typically, the first language and the second language may be different classes of languages, for example the first language may be Mandarin and the second language may be one or more dialects. Further, the first language audio and the second language audio correspond to the same text. That is, the first language audio is the same content as the second language audio. The execution body may perform syllable segmentation on the first language audio and the second language audio respectively to generate a first syllable label sequence and a second syllable label sequence. The first language audio and the second language audio may be segmented by the same syllable segmentation method after being aligned in frame level, so that the first syllable label sequence and the second syllable label sequence have the same number of label types. For example, if the first syllable tag sequence includes 2000 tags, then the second syllable tag sequence also includes 2000 tags.

Step 202, processing the second language audio and the second syllable label sequence by using the pre-trained first language identification model to obtain a connecting time sequence classification Viterbi sequence.

In this embodiment, the executing entity may process the second language audio and the second syllable label sequence by using a pre-trained first language identification model, so as to obtain a connection timing classification viterbi sequence. For example, the execution body may directly input the second language audio and the second syllable label sequence to the first language identification model, resulting in a concatenated time-ordered viterbi sequence. For another example, the execution body may first process the second syllable label sequence, and then input the second language audio and the processed second syllable label sequence into the first language identification model to obtain the connection time-series classification viterbi sequence. Here, the connection timing class Viterbi is a CTC (connection timing class) Viterbi.

In this embodiment, the first language identification model may be trained based on the first syllable label sequence, and is used for identifying the first language. Generally, the execution body may first extract a Filter Bank coefficient (Filter Bank) feature of the first language audio; and training the deep neural network based on the filter bank coefficient characteristics and the first syllable label sequence to obtain a first language identification model. Wherein the number of nodes of the output layer of the first language identification model may be equal to the number of tag classes of the first syllable tag sequence. For example, if the first syllable label sequence includes 2000 kinds of labels, the output layer of the first language identification model may include 2000 nodes. The deep neural network used to train the first language recognition model may be, for example, a LSTM (Long Short-Term Memory) network based on CTC criteria.

In some optional implementations of the embodiment, the training of the first language recognition model may optimize the deep neural network using a training criterion based on connection timing classification. The training criterion based on the connection timing classification can be shown as the following formula:

wherein, a^kFor the output of the network labeled k (unactivated function), y^kThe score of a label k referenced at a certain moment, s is the state, x is the input signature sequence, z is the path of CTC at time t, Σ_{s∈label(z，k)}α(s) β(s) is the score (obtained by multiplying the forward score α(s) and the backward score β(s) of the CTC) belonging to the tag k in a path of the CTC at a certain time, and p (z | x) is the total score of the path traversed by the CTC at a certain time.

Step 203, based on the connecting time sequence classification Viterbi sequence, determining a connecting time sequence classification Viterbi score of each second syllable label in the second syllable label sequence.

In this embodiment, the execution main body may determine the connection timing classification viterbi score for each of the second syllable labels in the second syllable label sequence based on the connection timing classification viterbi sequence. Specifically, for each second syllable label in the second syllable label sequence, the executing entity may search the second syllable label from the concatenated sequential classification viterbi sequence and determine the concatenated sequential classification viterbi score of the second syllable label based on the position of the second syllable label in the concatenated sequential classification viterbi sequence. In general, the concatenated timing category Viterbi score of the second syllable label is associated with a position of the second syllable label in the concatenated timing category Viterbi sequence, which is different from the concatenated timing category Viterbi score. The connecting time sequence classification Viterbi score takes a value between 0 and 1, which can reflect the similarity with the corresponding first syllable label to a certain extent, and the higher the connecting time sequence classification Viterbi score is, the closer the connecting time sequence classification Viterbi score is to the corresponding first syllable label.

In addition, for the case where the same second syllable label appears a plurality of times in the concatenated sequential classification viterbi sequence, the executing body may first determine a plurality of concatenated sequential classification viterbi scores based on a plurality of positions where the second syllable label appears in the concatenated sequential classification viterbi sequence; an average is then calculated as the connected time series categorical viterbi score for the second syllable label.

At step 204, a difference syllable label is determined from the second syllable label sequence based on the determined connection timing classification Viterbi score.

In this embodiment, the executing body may determine the different syllable label from the second syllable label sequence based on the determined connection timing classification viterbi score. For example, the execution body may sort all the second syllable labels in the second syllable label sequence from high to low according to the connecting timing classification viterbi score, and select a predetermined number (e.g., 400) of second syllable labels from the side where the connecting timing classification viterbi score is low, where the selected second syllable labels are difference syllable labels.

It should be understood that the different syllable labels are pronounced with a large difference from the first language, and the corresponding text content is the same as the first language, so after the different syllable labels are determined, the execution body will usually add a corresponding multiple tone to the corresponding text in the decoding dictionary.

And step 205, performing hybrid training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a hybrid language identification model.

In this embodiment, the executing entity may perform hybrid training on the deep neural network based on the first syllable label sequence and the different syllable labels to obtain a hybrid language identification model. The hybrid language identification model can identify the first language and the second language. That is, the hybrid language recognition model enables recognition of multiple languages supported by the same model.

In general, the execution body may perform hybrid training on the deep neural network based on the filter bank coefficient characteristics, the first syllable tag sequence, and the difference syllable tags to obtain a hybrid language identification model, where the number of nodes of an output layer of the hybrid language identification model may be equal to the sum of the number of tag types of the first syllable tag sequence and the number of tag types of the difference syllable tags. For example, if the first syllable label sequence includes 2000 kinds of labels and the difference syllable label includes 400 kinds of labels, the output layer of the hybrid language identification model may include 2400 nodes. And when the hybrid language recognition model is trained, a training criterion based on connection time sequence classification can be adopted to optimize the deep neural network. It should be noted that the step of training the hybrid language identification model is similar to the step of training the first language identification model, and is not repeated here.

For ease of understanding, FIG. 3 shows a schematic structural diagram of a hybrid language recognition model. As shown in fig. 3, the hybrid language recognition model may include an input layer, a hidden layer, and an output layer. Taking the mixed recognition of the three categories of Mandarin, dialect A and dialect B as an example, there are 2000 syllable labels for Mandarin, dialect A and dialect B. For dialect a, there are 400 syllable labels among 2000 syllable labels that are different from mandarin, so the 400 different syllable labels of dialect a will serve as independent modeling units, and the other 1600 syllable labels thereof will share modeling units with mandarin. Similarly, for dialect B, there are 600 syllable labels with larger difference from mandarin among 2000 syllable labels, so the 600 different syllable labels of dialect B will serve as independent modeling units, and the other 1400 syllable labels thereof will share modeling units with mandarin. Thus, a hybrid language identification model was trained based on the syllable labels of Mandarin, the difference syllable labels of dialect A, and the difference syllable labels of dialect B, the output layer of which would include 3000 nodes. Among them, 2000 kinds of syllable labels of mandarin corresponds to 2000 nodes. The 400 different syllable labels of dialect a correspond to 400 independent nodes, and the 1600 other syllable labels share nodes with mandarin. Similarly, the 600 different syllable labels of dialect B correspond to 600 independent nodes, and the 1400 syllable labels outside the nodes share nodes with mandarin.

The method for training the hybrid language identification model provided by the embodiment of the application comprises the steps of firstly generating a first syllable label sequence of a first language audio and a second syllable label sequence of a second language audio; then, a first language identification model pre-trained based on the first syllable label sequence is used for processing a second language audio and a second syllable label sequence to obtain a connecting time sequence classification Viterbi sequence; then based on the connection time sequence classification Viterbi sequence, determining the connection time sequence classification Viterbi score of each second syllable label in the second syllable label sequence; then based on the determined connecting time sequence classification Viterbi score, determining a difference syllable label from the second syllable label sequence; and finally, performing mixed training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a mixed language recognition model. The different syllable labels are determined from the second syllable label sequence based on the connecting time sequence classification Viterbi score, and the mixed language recognition model is trained based on the first syllable label sequence and the different syllable labels, so that the model training workload is reduced. In addition, the mixed language recognition model is trained based on the syllable labels of multiple languages, and the recognition that the same model supports multiple languages is realized. In addition, the user does not need to switch among a plurality of models, and the user operation is simplified.

With further reference to FIG. 4, a flow 400 of yet another embodiment of a method for training a hybrid language recognition model in accordance with the present application is illustrated. The method for training the hybrid language recognition model comprises the following steps:

step 401, extracting mel-frequency cepstrum coefficient features of the first language audio.

In this embodiment, an executive (e.g., server 103 shown in fig. 1) of a method for training a hybrid language recognition model may extract Mel Frequency Cepstral Coefficients (MFCCs) features of audio in a first language. Wherein the first language may be mandarin chinese, for example.

Step 402, training a Gaussian mixture model based on the Mel frequency cepstrum coefficient characteristics and the text corresponding to the first language audio to obtain an aligned Gaussian mixture model and a first syllable label sequence.

In this embodiment, the executing body may train a Gaussian Mixture Model (GMM) based on mel-frequency cepstral coefficient features and a text corresponding to the first language audio, so as to obtain an aligned Gaussian mixture model and a first syllable label sequence. Generally, a gaussian mixture model can be trained based on mel-frequency cepstral coefficient features and text corresponding to the first language audio, and is used for aligning the audio at a frame level, so that the model is called an aligned gaussian mixture model. And the first syllable label sequence can be obtained by the text corresponding to the Mel frequency cepstrum coefficient characteristic and the first language audio through the aligned Gaussian mixture model.

Step 403, inputting the second language audio into the aligned gaussian mixture model to obtain a second syllable label sequence.

In this embodiment, the execution body may input the second language audio to the aligned gaussian mixture model to obtain the second syllable label sequence. Here, aligning the second language audio using the aligned gaussian mixture model can ensure that the syllable labels of the first language audio and the second language audio are strictly consistent. Thus, the number of tag types for the second syllable tag sequence obtained using the aligned Gaussian mixture model will be equal to the number of tag types for the first syllable tag sequence. Wherein the second language is different from the first language category, for example the first language is mandarin and the second language is one or more dialects.

And step 404, the second syllable label sequence is de-duplicated to obtain a de-duplicated syllable label sequence.

In this embodiment, the execution body may perform de-emphasis on the second syllable label sequence to obtain a de-emphasized syllable label sequence. For example, if the second syllable label sequence is "0000C 00A 0B 0000", the de-duplicated syllable label sequence obtained by the de-duplication process may be "0C 0A 0B 0", where "0" is a mute frame.

Step 405, mute frames are removed from the de-emphasized syllable label sequence to obtain an effective syllable label sequence.

In this embodiment, the execution body may mute the de-emphasized syllable label sequence to obtain an effective syllable label sequence. For example, if the de-emphasized syllable tag sequence is "0C 0A 0B 0", the de-emphasized frame will result in a valid syllable tag sequence of "C A B", where "0" is an emphasized frame.

And 406, inserting a space into the effective syllable label sequence to obtain a space-inserted syllable label sequence.

In this embodiment, the execution body may insert a space into the valid syllable label sequence to obtain a blank syllable label sequence. Typically, a space is inserted between any two adjacent valid syllable labels in the sequence of valid syllable labels. In addition, a space is also inserted into the beginning and the end of the valid syllable label sequence. For example, if the valid syllable label sequence is "CA B", the space is inserted to obtain a blank syllable label sequence

Wherein the space can be a symbol for meeting

And (4) showing.

Step 407, the second language audio and the sequence of the syllable labels are input to the first language identification model to obtain a connecting time sequence classification viterbi sequence.

In this embodiment, the executing entity may input the second audio and the sequence of the dummy syllable labels to the first language identification model to obtain the connecting time-series classification viterbi sequence.

For ease of understanding, FIG. 5 illustrates Viterbi's using a first language identification model to make a connection timing classificationSchematically illustrated in the figure. As shown in fig. 5, with the empty syllable label sequence

For example, in the legend, in the sequence of empty syllable labels

May be represented by open circles and the active syllable labels in the sequence of empty syllable labels may be represented by filled circles. Here, N empty syllable label sequences are inserted

Vertically, where T represents the number of frames of second language audio. The method for connecting the time sequence classification Viterbi by using the first language identification model comprises the following steps:

first, if the state at the initial time (t ═ 1) is a space or a valid syllable label, then:

wherein alpha is₁(1) Indicates a total score of all the paths passing through state 1 at the initial time (t ═ 1), and α₁(2) Indicates the total score of all the paths passing through state 2 at the initial time (t ═ 1), and α₁(s) represents the total score of all paths passing state s at the initial time (t ═ 1),

a score indicating that the network (first language recognition model) outputs a space at an initial time (t ═ 1),

the score indicates that the network output is a valid tag at the initial time (t ═ 1).

Secondly, iteration:

wherein,

α_t(s) represents the total score of all paths through state s at time t, obtained by iteration, where,

representing the output of the network (first language identification model) at time t as l_sScore of l_sEither a space or a valid syllable label.

And finally, selecting the path with the highest total score from the paths meeting the formula to obtain the connection time sequence classification Viterbi sequence.

The paths satisfying the above formula are shown by arrows in fig. 5, and there are multiple paths. And the connection timing classification viterbi sequence is one path in which the total score is the highest. Assuming that T is 20, the obtained connection timing classification viterbi sequence is

Step 408, for each valid syllable label in the sequence of valid syllable labels, determining a concatenated sequential classification Viterbi score for the valid syllable label based on the position of the valid syllable label in the concatenated sequential classification Viterbi sequence.

In this embodiment, for each valid syllable label in the valid syllable label sequence, the execution body may search the valid syllable label from the concatenated sequential classification viterbi sequence and determine the concatenated sequential classification viterbi score of the valid syllable label based on the position of the valid syllable label in the concatenated sequential classification viterbi sequence. In general, the concatenated timing category viterbi scores for valid syllable labels are related to the position of the valid syllable label in the concatenated timing category viterbi sequence, which varies from position to position. The connecting time sequence classification Viterbi score takes a value from 0 to 1, which can reflect the similarity with the corresponding first syllable label to a certain extent, and the higher the connecting time sequence classification Viterbi score is, the closer the connecting time sequence classification Viterbi score is to the corresponding first syllable label.

In addition, for the case where the same valid syllable label appears multiple times in the concatenated sequential classification viterbi sequence, the execution main body may first determine a plurality of concatenated sequential classification viterbi scores based on a plurality of positions where the valid syllable label appears in the concatenated sequential classification viterbi sequence; an average is then calculated as the connected time series categorical viterbi score for the valid syllable label.

In step 409, the valid syllable label with the connecting time sequence classification Viterbi score smaller than the preset threshold in the valid syllable label sequence is determined as the difference syllable label.

In this embodiment, the execution body may determine a valid syllable label of the valid syllable label sequence, which is connected to the time series classification viterbi score smaller than a preset threshold (e.g., 0.2), as the difference syllable label. For example, the execution body may sort all the valid syllable labels in the valid syllable label sequence in the order from high to low according to the connecting timing classification viterbi score, sequentially compare the connecting timing classification viterbi score with a preset threshold value from the side where the connecting timing classification viterbi score is low, stop the comparison until a valid syllable label where the connecting timing classification viterbi score is not less than the preset threshold value appears, and determine the valid syllable label that has been compared before as the difference syllable label.

And step 410, performing hybrid training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a hybrid language identification model.

In this embodiment, the specific operation of step 410 has been described in detail in step 205 in the embodiment shown in fig. 2, and is not described herein again.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for training the hybrid language identification model in the present embodiment highlights the step of generating the syllable label and the step of generating the concatenated sequential classification viterbi sequence. Therefore, the scheme described in this embodiment can ensure that the first syllable tag sequence of the first language audio and the second syllable tag sequence of the second language audio are strictly consistent by processing the second language with the aligned gaussian mixture model trained by the first language data. In addition, after the second syllable label sequence is subjected to de-duplication, de-mute frame and space insertion, the connection time sequence classification Viterbi is made, so that the generated connection time sequence classification Viterbi sequence can better express the connection time sequence classification Viterbi score of each effective syllable label.

With further reference to fig. 6, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for training a hybrid language recognition model, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 6, the apparatus 600 for training a hybrid language recognition model of the present embodiment may include: a generating unit 601, a processing unit 602, a first determining unit 603, a second determining unit 604 and a training unit 605. Wherein the generating unit 601 is configured to generate a first syllable tag sequence of a first language audio and a second syllable tag sequence of a second language audio; a processing unit 602 configured to process the second language audio and the second syllable label sequence by using a pre-trained first language identification model, and obtain a connection timing classification viterbi sequence, wherein the first language identification model is obtained by training based on the first syllable label sequence; a first determining unit 603 configured to determine a connection timing classification viterbi score for each second syllable label in the second syllable label sequence based on the connection timing classification viterbi sequence; a second determining unit 604 configured to determine a difference syllable label from the second syllable label sequence based on the determined connection timing classification viterbi score; a training unit 605 configured to perform hybrid training on the deep neural network based on the first syllable label sequence and the difference syllable labels, resulting in a hybrid language identification model.

In the present embodiment, in the apparatus 600 for training a hybrid language recognition model: the detailed processing of the generating unit 601, the processing unit 602, the first determining unit 603, the second determining unit 604 and the training unit 605 and the technical effects thereof can refer to the related descriptions of step 201 and step 205 in the corresponding embodiment of fig. 2, and are not described herein again.

In some optional implementations of this embodiment, the generating unit 601 is further configured to: extracting mel frequency cepstrum coefficient characteristics of the first language audio; and training the Gaussian mixture model based on the Mel frequency cepstrum coefficient characteristics and the text corresponding to the first language audio to obtain an aligned Gaussian mixture model and a first syllable label sequence.

In some optional implementations of this embodiment, the generating unit 601 is further configured to: and inputting the second language audio into the aligned Gaussian mixture model to obtain a second syllable label sequence, wherein the label number of the second syllable label sequence is equal to that of the first syllable label sequence.

In some optional implementations of this embodiment, the processing unit 602 is further configured to: removing the weight of the second syllable label sequence to obtain a removed weight syllable label sequence; removing the mute frame from the de-stressed syllable label sequence to obtain an effective syllable label sequence; inserting a space into the effective syllable label sequence to obtain a space-inserted syllable label sequence; and inputting the second language audio and the inserting syllabic label sequence into the first language identification model to obtain a connecting time sequence classification Viterbi sequence.

In some optional implementations of this embodiment, the first determining unit 603 is further configured to: for each valid syllable label in the sequence of valid syllable labels, a concatenated timing class Viterbi score for the valid syllable label is determined based on the position of the valid syllable label in the concatenated timing class Viterbi sequence.

In some optional implementations of this embodiment, the second determining unit 604 is further configured to: and determining the effective syllable label with the connecting time sequence classification Viterbi score smaller than a preset threshold value in the effective syllable label sequence as a difference syllable label.

In some optional implementations of this embodiment, the first language identification model is trained by: extracting the filter bank coefficient characteristics of the first language audio; and training the deep neural network based on the filter bank coefficient characteristics and the first syllable label sequence to obtain a first language identification model, wherein the node number of an output layer of the first language identification model is equal to the label number of the first syllable label sequence.

In some optional implementations of this embodiment, the training unit 605 is further configured to: and performing mixed training on the deep neural network based on the filter bank coefficient characteristics, the first syllable label sequence and the difference syllable labels to obtain a mixed language identification model, wherein the node number of an output layer of the mixed language identification model is equal to the sum of the label number of the first syllable label sequence and the label number of the difference syllable labels.

In some optional implementations of the embodiment, the training of the first language recognition model or the hybrid language recognition model optimizes the deep neural network using a training criterion based on connection timing classification.

Referring now to FIG. 7, a block diagram of a computer system 700 suitable for use in implementing an electronic device (e.g., server 103 shown in FIG. 1) of an embodiment of the present application is shown. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program, when executed by a Central Processing Unit (CPU)701, performs the above-described functions defined in the method of the present application.

It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or electronic device. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a generation unit, a processing unit, a first determination unit, a second determination unit, and a training unit. Where the names of the units do not constitute a limitation of the units themselves in this case, for example, a generating unit may also be described as a "unit generating a first syllable tag sequence of a first language audio and a second syllable tag sequence of a second language audio".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: generating a first syllable tag sequence of the first language audio and a second syllable tag sequence of the second language audio; processing a second language audio and a second syllable label sequence by utilizing a pre-trained first language identification model to obtain a connecting time sequence classification Viterbi sequence, wherein the first language identification model is obtained by training based on the first syllable label sequence; determining a connection timing classification viterbi score for each second syllable label in the sequence of second syllable labels based on the connection timing classification viterbi sequence; determining a difference syllable label from the second syllable label sequence based on the determined connection timing classification viterbi score; and performing mixed training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a mixed language identification model.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for training a hybrid language recognition model, comprising:

generating a first syllable tag sequence of the first language audio and a second syllable tag sequence of the second language audio;

processing the second language audio and the second syllable label sequence by utilizing a pre-trained first language identification model to obtain a connecting time sequence classification Viterbi sequence, wherein the first language identification model is obtained by training based on the first syllable label sequence;

determining a concatenated timing classification viterbi score for each second syllable label in the sequence of second syllable labels based on the concatenated timing classification viterbi sequence;

determining a difference syllable label from the second syllable label sequence based on the determined connection timing classification viterbi score;

and performing mixed training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a mixed language identification model.

2. The method of claim 1, wherein the generating a first syllable tag sequence of first language audio comprises:

extracting mel frequency cepstrum coefficient characteristics of the first language audio;

and training a Gaussian mixture model based on the Mel frequency cepstrum coefficient characteristics and the text corresponding to the first language audio to obtain an aligned Gaussian mixture model and the first syllable label sequence.

3. The method of claim 2, wherein the generating a second syllable tag sequence of second language audio comprises:

and inputting the second language audio into the aligned Gaussian mixture model to obtain the second syllable label sequence, wherein the label number of the second syllable label sequence is equal to that of the first syllable label sequence.

4. The method of claim 1, wherein the processing the second language audio and the second syllable label sequence using a pre-trained first language recognition model to obtain a concatenated time-ordered sorted viterbi sequence comprises:

removing the weight of the second syllable label sequence to obtain a removed weight syllable label sequence;

removing a mute frame from the de-stressed syllable label sequence to obtain an effective syllable label sequence;

inserting a space into the effective syllable label sequence to obtain a space-inserted syllable label sequence;

and inputting the second language audio and the inserting syllabic label sequence into the first language identification model to obtain the connecting time sequence classification Viterbi sequence.

5. The method of claim 4, wherein the determining a concatenated timing classification Viterbi score for each of the second sequence of syllable labels based on the concatenated timing classification Viterbi sequence comprises:

for each valid syllable label in the sequence of valid syllable labels, determining a concatenated timing class Viterbi score for the valid syllable label based on its position in the concatenated timing class Viterbi sequence.

6. The method of claim 4 or 5, wherein said categorizing a Viterbi score based on the determined connection timing determines a difference syllable label from the second syllable label sequence, comprising:

and determining the effective syllable label with the connecting time sequence classification Viterbi score smaller than a preset threshold value in the effective syllable label sequence as a difference syllable label.

7. The method of claim 1, wherein the first language recognition model is trained by:

extracting filter bank coefficient characteristics of the first language audio;

training a deep neural network based on the filter bank coefficient characteristics and the first syllable label sequence to obtain a first language identification model, wherein the node number of an output layer of the first language identification model is equal to the label number of the first syllable label sequence.

8. The method of claim 7, wherein the hybrid training of the deep neural network based on the first sequence of syllable labels and the difference syllable labels, resulting in a hybrid language recognition model, comprises:

and performing hybrid training on a deep neural network based on the filter bank coefficient characteristics, the first syllable label sequence and the difference syllable labels to obtain a hybrid language identification model, wherein the node number of an output layer of the hybrid language identification model is equal to the sum of the label number of the first syllable label sequence and the label number of the difference syllable labels.

9. The method of claim 7 or 8, wherein training the first or hybrid language recognition model optimizes the deep neural network using training criteria based on connection timing classification.

10. An apparatus for training a hybrid language recognition model, comprising:

a generating unit configured to generate a first syllable tag sequence of a first language audio and a second syllable tag sequence of a second language audio;

a processing unit configured to process the second language audio and the second syllable label sequence by using a pre-trained first language identification model to obtain a connection timing classification Viterbi sequence, wherein the first language identification model is obtained by training based on the first syllable label sequence;

a first determining unit configured to determine a connection timing classification viterbi score for each second syllable label in the second syllable label sequence based on the connection timing classification viterbi sequence;

a second determining unit configured to determine a difference syllable label from the second syllable label sequence based on the determined connection timing classification viterbi score;

a training unit configured to perform hybrid training on a deep neural network based on the first syllable label sequence and the difference syllable labels, resulting in a hybrid language recognition model.

11. The apparatus of claim 10, wherein the generating unit is further configured to:

12. The apparatus of claim 11, wherein the generating unit is further configured to:

13. The apparatus of claim 10, wherein the processing unit is further configured to:

14. The apparatus of claim 13, wherein the first determining unit is further configured to:

15. The apparatus of claim 13 or 14, wherein the second determining unit is further configured to:

16. The apparatus of claim 10, wherein the first language recognition model is trained by:

extracting filter bank coefficient characteristics of the first language audio;

17. The apparatus of claim 16, wherein the training unit is further configured to:

18. The apparatus of claim 16 or 17, wherein training the first or hybrid language recognition model optimizes the deep neural network using training criteria based on connection timing classification.

19. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.

20. A computer-readable medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, carries out the method according to any one of claims 1-9.