CN110675865A

CN110675865A - Method and apparatus for training hybrid language recognition models

Info

Publication number: CN110675865A
Application number: CN201911075088.2A
Authority: CN
Inventors: 袁胜龙
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-11-06
Filing date: 2019-11-06
Publication date: 2020-01-10
Anticipated expiration: 2039-11-06
Also published as: CN110675865B

Abstract

The embodiment of the application discloses a method and a device for training a hybrid language recognition model. One embodiment of the method comprises: generating a first syllable tag sequence of the first language audio and a second syllable tag sequence of the second language audio; inputting the second language audio to a pre-trained first language identification model to obtain a connection time sequence classification peak sequence; calculating the connection time sequence classification peak accuracy of each second syllable label in the second syllable label sequence based on the second syllable label sequence and the connection time sequence classification peak sequence; determining a difference syllable label from the second syllable label sequence based on the calculated connection timing classification peak accuracy; and performing mixed training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a mixed language identification model. The embodiment realizes that the same model supports the recognition of multiple languages.

Description

Method and apparatus for training hybrid language recognition models

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for training a hybrid language recognition model.

Background

With the development of voice recognition technology, the performance of voice recognition has been satisfactory and practical, for example, various input methods on mobile phones have a voice interaction function. In practical applications, there is speech recognition in dialect scenarios in addition to speech recognition in mandarin scenarios. At present, many speech interaction products supporting dialect speech recognition exist, for example, speech recognition selectable items on a mobile phone input method, and a user can select a corresponding dialect according to needs, for example, some smart televisions, smart refrigerators and the like which are made for a specific dialect.

In the related art, a mandarin chinese recognition model is usually adopted to perform speech recognition on mandarin chinese, a corresponding dialect recognition model is adopted to perform speech recognition on dialect, and when a user switches languages, the user needs to select the corresponding speech recognition model back and forth, which is tedious to operate. Moreover, as the supported dialects are more and more, the number of dialect recognition models needing to be trained is more and more, so that the workload of model training is higher.

Disclosure of Invention

The embodiment of the application provides a method and a device for training a hybrid language recognition model.

In a first aspect, an embodiment of the present application provides a method for training a hybrid language recognition model, including: generating a first syllable tag sequence of the first language audio and a second syllable tag sequence of the second language audio; inputting the second language audio to a pre-trained first language identification model to obtain a connection time sequence classification peak sequence, wherein the first language identification model is obtained by training based on a first syllable label sequence; calculating the connection time sequence classification peak accuracy of each second syllable label in the second syllable label sequence based on the second syllable label sequence and the connection time sequence classification peak sequence; determining a difference syllable label from the second syllable label sequence based on the calculated connection timing classification peak accuracy; and performing mixed training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a mixed language identification model.

In some embodiments, generating a first syllable tag sequence for the first language audio comprises: extracting mel frequency cepstrum coefficient characteristics of the first language audio; and training the Gaussian mixture model based on the Mel frequency cepstrum coefficient characteristics and the text corresponding to the first language audio to obtain an aligned Gaussian mixture model and a first syllable label sequence.

In some embodiments, generating a second syllable tag sequence for the second language audio comprises: and inputting the second language audio into the aligned Gaussian mixture model to obtain a second syllable label sequence, wherein the number of label types of the second syllable label sequence is equal to that of the first syllable label sequence.

In some embodiments, calculating the link timing classification peak accuracy for each second syllable label in the second syllable label sequence based on the second syllable label sequence and the link timing classification peak sequence comprises: removing the weight of the second syllable label sequence to obtain a removed weight syllable label sequence; removing the mute frame from the de-stressed syllable label sequence to obtain an effective syllable label sequence; and comparing the effective syllable label sequence with the connection time sequence classification peak sequence to obtain the connection time sequence classification peak accuracy of each second syllable label in the second syllable label sequence.

In some embodiments, comparing the valid sequence of syllable labels to the sequence of connection timing class peaks to obtain the connection timing class peak accuracy for each of the second syllable labels in the sequence of second syllable labels comprises: for each effective syllable label in the effective syllable label sequence, searching a connection time sequence classification peak corresponding to the position of the effective syllable label from the connection time sequence classification peak sequence; counting the number of the searched connection time sequence classification peaks which is the same as the number of the effective syllable labels; and calculating the ratio of the counted number to the total number of the effective syllable labels to obtain the correct rate of the connection time sequence classification peak of the effective syllable labels.

In some embodiments, determining a difference syllable label from the second syllable label sequence based on the calculated connection timing classification peak accuracy comprises: and determining the effective syllable label of which the correct rate of the connection time sequence classification peak in the effective syllable label sequence is less than a preset threshold value as a difference syllable label.

In some embodiments, the first language identification model is trained by: extracting the filter bank coefficient characteristics of the first language audio; and training the deep neural network based on the filter bank coefficient characteristics and the first syllable label sequence to obtain a first language identification model, wherein the node number of an output layer of the first language identification model is equal to the label type number of the first syllable label sequence.

In some embodiments, the hybrid training of the deep neural network based on the first syllable label sequence and the difference syllable labels results in a hybrid language recognition model, comprising: and performing mixed training on the deep neural network based on the filter bank coefficient characteristics, the first syllable label sequence and the difference syllable labels to obtain a mixed language identification model, wherein the node number of an output layer of the mixed language identification model is equal to the sum of the label type number of the first syllable label sequence and the label type number of the difference syllable labels.

In some embodiments, training the first or hybrid language recognition model optimizes the deep neural network using training criteria based on connection timing classification.

In a second aspect, an embodiment of the present application provides an apparatus for training a hybrid language recognition model, including: a generating unit configured to generate a first syllable tag sequence of a first language audio and a second syllable tag sequence of a second language audio; an input unit configured to input a second language audio to a pre-trained first language identification model, resulting in a connection timing classification peak sequence, wherein the first language identification model is trained based on a first syllable label sequence; a calculation unit configured to calculate a connection timing classification peak accuracy of each of the second syllable labels in the second syllable label sequence based on the second syllable label sequence and the connection timing classification peak sequence; a determining unit configured to determine a difference syllable label from the second syllable label sequence based on the calculated connection timing classification peak accuracy; and the training unit is configured to perform mixed training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a mixed language recognition model.

In some embodiments, the generating unit is further configured to: extracting mel frequency cepstrum coefficient characteristics of the first language audio; and training the Gaussian mixture model based on the Mel frequency cepstrum coefficient characteristics and the text corresponding to the first language audio to obtain an aligned Gaussian mixture model and a first syllable label sequence.

In some embodiments, the generating unit is further configured to: and inputting the second language audio into the aligned Gaussian mixture model to obtain a second syllable label sequence, wherein the number of label types of the second syllable label sequence is equal to that of the first syllable label sequence.

In some embodiments, the computing unit comprises: a de-emphasis subunit configured to de-emphasize the second syllable label sequence to obtain a de-emphasized syllable label sequence; a de-mute subunit configured to de-mute frames of the de-stressed syllable label sequence to obtain an effective syllable label sequence; and the comparison subunit is configured to compare the effective syllable label sequence with the connection time sequence classification peak sequence to obtain the connection time sequence classification peak accuracy of each second syllable label in the second syllable label sequence.

In some embodiments, the comparing subunit is further configured to: for each effective syllable label in the effective syllable label sequence, searching a connection time sequence classification peak corresponding to the position of the effective syllable label from the connection time sequence classification peak sequence; counting the number of the searched connection time sequence classification peaks which is the same as the number of the effective syllable labels; and calculating the ratio of the counted number to the total number of the effective syllable labels to obtain the correct rate of the connection time sequence classification peak of the effective syllable labels.

In some embodiments, the determining unit is further configured to: and determining the effective syllable label of which the correct rate of the connection time sequence classification peak in the effective syllable label sequence is less than a preset threshold value as a difference syllable label.

In some embodiments, the training unit is further configured to: and performing mixed training on the deep neural network based on the filter bank coefficient characteristics, the first syllable label sequence and the difference syllable labels to obtain a mixed language identification model, wherein the node number of an output layer of the mixed language identification model is equal to the sum of the label type number of the first syllable label sequence and the label type number of the difference syllable labels.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.

In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which, when executed by a processor, implements the method as described in any implementation manner of the first aspect.

According to the method and the device for training the mixed language recognition model, firstly, a first syllable label sequence of a first language audio and a second syllable label sequence of a second language audio are generated; then, inputting the second language audio to a first language identification model pre-trained based on the first syllable label sequence to obtain a connection time sequence classification peak sequence; then based on the second syllable label sequence and the connection time sequence classification peak sequence, calculating the connection time sequence classification peak accuracy of each second syllable label in the second syllable label sequence; then determining a difference syllable label from the second syllable label sequence based on the calculated connection timing sequence classification peak accuracy; and finally, performing mixed training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a mixed language recognition model. And determining a difference syllable label from the second syllable label sequence based on the connection time sequence classification peak accuracy, and training a mixed language recognition model based on the first syllable label sequence and the difference syllable label, thereby reducing the model training workload. In addition, the mixed language recognition model is trained based on the syllable labels of multiple languages, and the recognition that the same model supports multiple languages is realized. In addition, the user does not need to switch among a plurality of models, and the user operation is simplified.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture to which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for training a hybrid language recognition model according to the present application;

FIG. 3 is a schematic diagram of the structure of a hybrid language recognition model;

FIG. 4 is a flow diagram of yet another embodiment of a method for training a hybrid language recognition model according to the present application;

FIG. 5 is a schematic diagram illustrating an architecture of one embodiment of an apparatus for training a hybrid language recognition model according to the present application;

FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing an electronic device according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the present method for training a hybrid language recognition model or an apparatus for training a hybrid language recognition model may be applied.

As shown in fig. 1, a system architecture 100 may include a terminal device 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. Various communication client applications, such as a voice recognition application, etc., may be installed on the terminal device 101.

The terminal apparatus 101 may be hardware or software. When the terminal apparatus 101 is hardware, it may be various electronic apparatuses. Including but not limited to smart phones, tablets, smart speakers, smart televisions, smart refrigerators, and the like. When the terminal apparatus 101 is software, it can be installed in the above-described electronic apparatus. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.

The server 103 may provide various services. For example, the server 103 may analyze and process data such as the first language audio and the second language audio acquired from the terminal apparatus 101 and generate a processing result (e.g., a hybrid language identification model).

The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for training the hybrid language recognition model provided in the embodiment of the present application is generally performed by the server 103, and accordingly, the apparatus for training the hybrid language recognition model is generally disposed in the server 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for training a hybrid language recognition model in accordance with the present application is illustrated. The method for training the hybrid language recognition model comprises the following steps:

step 201, a first syllable label sequence of a first language audio and a second syllable label sequence of a second language audio are generated.

In this embodiment, an executing agent (e.g., server 103 shown in fig. 1) of a method for training a hybrid speech recognition model may generate a first syllable tag sequence of a first language audio and a second syllable tag sequence of a second language audio.

Typically, the first language and the second language may be different classes of languages, for example the first language may be Mandarin and the second language may be one or more dialects. Further, the first language audio and the second language audio correspond to the same text. That is, the first language audio is the same content as the second language audio. The execution body may perform syllable segmentation on the first language audio and the second language audio respectively to generate a first syllable label sequence and a second syllable label sequence. The first language audio and the second language audio may be segmented by the same syllable segmentation method after being aligned in frame level, so that the first syllable label sequence and the second syllable label sequence have the same number of label types. For example, if the first syllable tag sequence includes 2000 tags, then the second syllable tag sequence also includes 2000 tags.

Step 202, inputting the second language audio to the pre-trained first language identification model to obtain a connection time sequence classification peak sequence.

In this embodiment, the executing entity may input the second language audio to the first language identification model trained in advance to obtain the connection timing classification peak sequence. Generally, after the second language audio is input into the first language identification model, a forward algorithm is performed to obtain a connection timing classification peak corresponding to each sentence in the first language audio. Here, the connection timing Classification peak is CTC (connection timing Classification) peak.

In this embodiment, the first language identification model may be trained based on the first syllable label sequence, and is used for identifying the first language. Generally, the execution body may first extract a Filter Bank coefficient (Filter Bank) feature of the first language audio; and training the deep neural network based on the filter bank coefficient characteristics and the first syllable label sequence to obtain a first language identification model. Wherein the number of nodes of the output layer of the first language identification model may be equal to the number of tag classes of the first syllable tag sequence. For example, if the first syllable label sequence includes 2000 kinds of labels, the output layer of the first language identification model may include 2000 nodes. The deep neural network used to train the first language recognition model may be, for example, a LSTM (Long Short-Term Memory) network based on CTC criteria.

In some optional implementations of the embodiment, the training of the first language recognition model may optimize the deep neural network using a training criterion based on connection timing classification. The training criterion based on the connection timing classification can be shown as the following formula:

wherein, a^kFor the output of the network labeled k (unactivated function), y^kThe score of a label k referenced at a certain moment, s is the state, x is the input signature sequence, z is the path of CTC at time t, Σ_{s∈label(z，k)}α(s) β(s) is the score (obtained by multiplying the forward score α(s) and the backward score β(s) of the CTC) belonging to tag k in the CTC path at a certain time, and p (z | x) is the score at a certain timeThe total score of the path traversed by the CTC.

And step 203, calculating the connection time sequence classification peak accuracy of each second syllable label in the second syllable label sequence based on the second syllable label sequence and the connection time sequence classification peak sequence.

In this embodiment, the execution body may calculate the connection timing classification peak accuracy of each of the second syllable labels in the second syllable label sequence based on the second syllable label sequence and the connection timing classification peak sequence. Specifically, for each second syllable label in the second syllable label sequence, the execution main body may search a connection timing classification peak corresponding to the second syllable label from the connection timing classification peak sequence, and compare the searched connection timing classification peak with the second syllable label one by one. Every time a connecting time sequence classification peak is compared to be the same as the second syllable label, the correct number of the connecting time sequence classification peak of the second syllable label is added by one. When a connection time sequence classification peak is different from the second syllable label, the correct number of the connection time sequence classification peak of the second syllable label is not changed. And after the comparison is finished, calculating the ratio of the correct number of the connection time sequence classification peaks of the second syllable label to the total number of the searched connection time sequence classification peaks, namely the correct rate of the connection time sequence classification peaks of the second syllable label.

Step 204, determining a difference syllable label from the second syllable label sequence based on the calculated connection timing classification peak accuracy.

In this embodiment, the execution body may determine the different syllable label from the second syllable label sequence based on the calculated connection timing classification peak accuracy. For example, the execution body may sort all the second syllable tags in the second syllable tag sequence according to the sequence from high to low of the connection timing classification peak accuracy, and select a preset number (e.g., 400) of second syllable tags from the side with low connection timing classification peak accuracy, where the selected second syllable tags are the difference syllable tags.

It should be understood that the different syllable labels are pronounced with a large difference from the first language, and the corresponding text content is the same as the first language, so after the different syllable labels are determined, the execution body will usually add a corresponding multiple tone to the corresponding text in the decoding dictionary.

And step 205, performing hybrid training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a hybrid language identification model.

In this embodiment, the executing entity may perform hybrid training on the deep neural network based on the first syllable label sequence and the different syllable labels to obtain a hybrid language identification model. The hybrid language identification model can identify the first language and the second language. That is, the hybrid language recognition model enables recognition of multiple languages supported by the same model.

In general, the execution body may perform hybrid training on the deep neural network based on the filter bank coefficient characteristics, the first syllable tag sequence, and the difference syllable tags to obtain a hybrid language identification model, where the number of nodes of an output layer of the hybrid language identification model may be equal to the sum of the number of tag types of the first syllable tag sequence and the number of tag types of the difference syllable tags. For example, if the first syllable label sequence includes 2000 kinds of labels and the difference syllable label includes 400 kinds of labels, the output layer of the hybrid language identification model may include 2400 nodes. And when the hybrid language recognition model is trained, a training criterion based on connection time sequence classification can be adopted to optimize the deep neural network. It should be noted that the step of training the hybrid language identification model is similar to the step of training the first language identification model, and is not repeated here.

For ease of understanding, FIG. 3 shows a schematic structural diagram of a hybrid language recognition model. As shown in fig. 3, the hybrid language recognition model may include an input layer, a hidden layer, and an output layer. Taking the mixed recognition of the three categories of Mandarin, dialect A and dialect B as an example, there are 2000 syllable labels for Mandarin, dialect A and dialect B. For dialect a, there are 400 syllable labels among 2000 syllable labels that are different from mandarin, so the 400 different syllable labels of dialect a will serve as independent modeling units, and the other 1600 syllable labels thereof will share modeling units with mandarin. Similarly, for dialect B, there are 600 syllable labels with larger difference from mandarin among 2000 syllable labels, so the 600 different syllable labels of dialect B will serve as independent modeling units, and the other 1400 syllable labels thereof will share modeling units with mandarin. Thus, a hybrid language identification model was trained based on the syllable labels of Mandarin, the difference syllable labels of dialect A, and the difference syllable labels of dialect B, the output layer of which would include 3000 nodes. Among them, 2000 kinds of syllable labels of mandarin corresponds to 2000 nodes. The 400 different syllable labels of dialect a correspond to 400 independent nodes, and the 1600 other syllable labels share nodes with mandarin. Similarly, the 600 different syllable labels of dialect B correspond to 600 independent nodes, and the 1400 syllable labels outside the nodes share nodes with mandarin.

The method for training the hybrid language identification model provided by the embodiment of the application comprises the steps of firstly generating a first syllable label sequence of a first language audio and a second syllable label sequence of a second language audio; then, inputting the second language audio to a first language identification model pre-trained based on the first syllable label sequence to obtain a connection time sequence classification peak sequence; then based on the second syllable label sequence and the connection time sequence classification peak sequence, calculating the connection time sequence classification peak accuracy of each second syllable label in the second syllable label sequence; then determining a difference syllable label from the second syllable label sequence based on the calculated connection timing sequence classification peak accuracy; and finally, performing mixed training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a mixed language recognition model. And determining a difference syllable label from the second syllable label sequence based on the connection time sequence classification peak accuracy, and training a mixed language recognition model based on the first syllable label sequence and the difference syllable label, thereby reducing the model training workload. In addition, the mixed language recognition model is trained based on the syllable labels of multiple languages, and the recognition that the same model supports multiple languages is realized. In addition, the user does not need to switch among a plurality of models, and the user operation is simplified.

With further reference to FIG. 4, a flow 400 of yet another embodiment of a method for training a hybrid language recognition model in accordance with the present application is illustrated. The method for training the hybrid language recognition model comprises the following steps:

step 401, extracting mel-frequency cepstrum coefficient features of the first language audio.

In this embodiment, an executive (e.g., server 103 shown in fig. 1) of a method for training a hybrid language recognition model may extract Mel Frequency Cepstral Coefficients (MFCCs) features of audio in a first language. Wherein the first language may be mandarin chinese, for example.

Step 402, training a Gaussian mixture model based on the Mel frequency cepstrum coefficient characteristics and the text corresponding to the first language audio to obtain an aligned Gaussian mixture model and a first syllable label sequence.

In this embodiment, the executing body may train a Gaussian Mixture Model (GMM) based on mel-frequency cepstral coefficient features and a text corresponding to the first language audio, so as to obtain an aligned Gaussian mixture model and a first syllable label sequence. Generally, a gaussian mixture model can be trained based on mel-frequency cepstral coefficient features and text corresponding to the first language audio, and is used for aligning the audio at a frame level, so that the model is called an aligned gaussian mixture model. And the first syllable label sequence can be obtained by the text corresponding to the Mel frequency cepstrum coefficient characteristic and the first language audio through the aligned Gaussian mixture model.

Step 403, inputting the second language audio into the aligned gaussian mixture model to obtain a second syllable label sequence.

In this embodiment, the execution body may input the second language audio to the aligned gaussian mixture model to obtain the second syllable label sequence. Here, aligning the second language audio using the aligned gaussian mixture model can ensure that the syllable labels of the first language audio and the second language audio are strictly consistent. Thus, the number of tag types for the second syllable tag sequence obtained using the aligned Gaussian mixture model will be equal to the number of tag types for the first syllable tag sequence. Wherein the second language is different from the first language category, for example the first language is mandarin and the second language is one or more dialects.

Step 404, inputting the second language audio to the pre-trained first language identification model to obtain a connection timing sequence classification peak sequence.

In this embodiment, the specific operation of step 404 has been described in detail in step 202 in the embodiment shown in fig. 2, and is not described herein again.

And 405, removing the weight of the second syllable label sequence to obtain a removed weight syllable label sequence.

In this embodiment, the execution body may perform de-emphasis on the second syllable label sequence to obtain a de-emphasized syllable label sequence. For example, if the second syllable label sequence is "0000 a a a 00 b b b b b 000 c c c 00 d d d 0 e e 000 f", the de-duplicated syllable label sequence obtained by the de-duplication process may be "0 a 0b 0 c 0d 0 f", where "0" is a mute frame.

And 406, removing the mute frame from the de-stressed syllable label sequence to obtain an effective syllable label sequence.

In this embodiment, the execution body may mute the de-emphasized syllable label sequence to obtain an effective syllable label sequence. For example, if the de-emphasized syllable tag sequence is "0 a 0b 0 c 0d 0 f", the valid syllable tag sequence obtained after removing the mute frame may be "a b c d e f", where "0" is the mute frame.

Step 407, for each valid syllable label in the valid syllable label sequence, searching the connection time sequence classification peak corresponding to the position of the valid syllable label from the connection time sequence classification peak sequence.

In this embodiment, the execution body may compare the valid syllable label sequence with the connection timing classification peak sequence to obtain the connection timing classification peak accuracy of each valid syllable label in the valid syllable label sequence. Specifically, for each valid syllable label in the valid syllable label sequence, the execution body may search the connection timing classification peak corresponding to the position of the valid syllable label in the valid syllable label sequence from the connection timing classification peak sequence. Wherein the position of the effective syllable label in the effective syllable label sequence is the same as the position of the corresponding connection time sequence classification peak in the connection time sequence classification peak sequence.

Step 408, count the number of the searched connection timing sequence classification peaks same as the number of the valid syllable labels.

In this embodiment, the execution main body may count the number of the searched connection timing classification peaks and the number of the valid syllable labels. The number of the searched connection time sequence classification peaks is the same as the number of the valid syllable labels, namely the correct number of the connection time sequence classification peaks of the valid syllable labels.

Step 409, calculating the ratio of the counted number to the total number of the valid syllable labels to obtain the correct rate of the connection time sequence classification peak of the valid syllable labels.

In this embodiment, the execution body may calculate a ratio of the counted number to the total number of the valid syllable labels in the valid syllable label sequence to obtain the correct rate of the connected time sequence classification peak of the valid syllable labels.

Step 410, determining the valid syllable label with the correct rate of the connection time sequence classification peak in the valid syllable label sequence smaller than the preset threshold value as the difference syllable label.

In this embodiment, the execution body may determine, as the difference syllable label, a valid syllable label in the valid syllable label sequence, in which the connection timing classification peak accuracy is smaller than a preset threshold (e.g., 20%). For example, the execution body may sequence all valid syllable tags in the valid syllable tag sequence in an order from high to low in the correct rate of the connected time series classification peak, sequentially compare the correct rate of the connected time series classification peak of the valid syllable tags with a preset threshold from the side with the low correct rate of the connected time series classification peak, stop the comparison until the valid syllable tags with the correct rate of the connected time series classification peak not less than the preset threshold appear, and determine the valid syllable tags that have been compared before as the difference syllable tags.

And 411, performing hybrid training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a hybrid language identification model.

In this embodiment, the specific operation of step 411 has been described in detail in step 205 in the embodiment shown in fig. 2, and is not described herein again.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for training the hybrid language identification model in the present embodiment highlights the steps of generating the syllable labels and calculating the connection timing classification peak accuracy. Therefore, the scheme described in this embodiment can ensure that the first syllable tag sequence of the first language audio and the second syllable tag sequence of the second language audio are strictly consistent by processing the second language with the aligned gaussian mixture model trained by the first language data. In addition, after the second syllable label sequence is subjected to de-weighting and de-mute frame, the effective syllable labels in the effective syllable label sequence correspond to the connection time sequence classification peaks in the connection time sequence classification peak sequence one by one, so that comparison is convenient.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for training a hybrid language recognition model, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 5, the apparatus 500 for training a hybrid language recognition model of the present embodiment may include: a generating unit 501, an input unit 502, a calculating unit 503, a determining unit 504 and a training unit 505. Wherein the generating unit 501 is configured to generate a first syllable tag sequence of a first language audio and a second syllable tag sequence of a second language audio; an input unit 502 configured to input a second language audio to a pre-trained first language identification model, resulting in a connection timing classification peak sequence, wherein the first language identification model is trained based on a first syllable label sequence; a calculating unit 503 configured to calculate a connection timing classification peak accuracy rate of each second syllable label in the second syllable label sequence based on the second syllable label sequence and the connection timing classification peak sequence; a determining unit 504 configured to determine a difference syllable label from the second syllable label sequence based on the calculated connection timing classification peak accuracy; a training unit 505 configured to perform hybrid training on the deep neural network based on the first syllable label sequence and the difference syllable labels, resulting in a hybrid language identification model.

In the present embodiment, in the apparatus 500 for training a hybrid language recognition model: the specific processing of the generating unit 501, the input unit 502, the calculating unit 503, the determining unit 504 and the training unit 505 and the technical effects thereof can refer to the related description of step 201 and step 205 in the corresponding embodiment of fig. 2, which is not repeated herein.

In some optional implementations of this embodiment, the generating unit 501 is further configured to: extracting mel frequency cepstrum coefficient characteristics of the first language audio; and training the Gaussian mixture model based on the Mel frequency cepstrum coefficient characteristics and the text corresponding to the first language audio to obtain an aligned Gaussian mixture model and a first syllable label sequence.

In some optional implementations of this embodiment, the generating unit 501 is further configured to: and inputting the second language audio into the aligned Gaussian mixture model to obtain a second syllable label sequence, wherein the number of label types of the second syllable label sequence is equal to that of the first syllable label sequence.

In some optional implementations of this embodiment, the calculating unit 503 includes: a de-weight subunit (not shown) configured to de-weight the second syllable label sequence, resulting in a de-weighted syllable label sequence; a de-mute subunit (not shown) configured to de-mute the de-emphasized syllable label sequence to obtain a valid syllable label sequence; and a comparison subunit (not shown in the figure) configured to compare the valid syllable label sequence and the connection time sequence classification peak sequence to obtain the connection time sequence classification peak accuracy of each second syllable label in the second syllable label sequence.

In some optional implementations of this embodiment, the comparing subunit is further configured to: for each effective syllable label in the effective syllable label sequence, searching a connection time sequence classification peak corresponding to the position of the effective syllable label from the connection time sequence classification peak sequence; counting the number of the searched connection time sequence classification peaks which is the same as the number of the effective syllable labels; and calculating the ratio of the counted number to the total number of the effective syllable labels to obtain the correct rate of the connection time sequence classification peak of the effective syllable labels.

In some optional implementations of this embodiment, the determining unit 504 is further configured to: and determining the effective syllable label of which the correct rate of the connection time sequence classification peak in the effective syllable label sequence is less than a preset threshold value as a difference syllable label.

In some optional implementations of this embodiment, the first language identification model is trained by: extracting the filter bank coefficient characteristics of the first language audio; and training the deep neural network based on the filter bank coefficient characteristics and the first syllable label sequence to obtain a first language identification model, wherein the node number of an output layer of the first language identification model is equal to the label type number of the first syllable label sequence.

In some optional implementations of the present embodiment, the training unit 505 is further configured to: and performing mixed training on the deep neural network based on the filter bank coefficient characteristics, the first syllable label sequence and the difference syllable labels to obtain a mixed language identification model, wherein the node number of an output layer of the mixed language identification model is equal to the sum of the label type number of the first syllable label sequence and the label type number of the difference syllable labels.

In some optional implementations of the embodiment, the training of the first language recognition model or the hybrid language recognition model optimizes the deep neural network using a training criterion based on connection timing classification.

Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use in implementing an electronic device (e.g., server 103 shown in FIG. 1) of an embodiment of the present application is shown. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601.

It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or electronic device. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a generation unit, an input unit, a calculation unit, a determination unit, and a training unit. Where the names of the units do not constitute a limitation of the units themselves in this case, for example, a generating unit may also be described as a "unit generating a first syllable tag sequence of a first language audio and a second syllable tag sequence of a second language audio".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: generating a first syllable tag sequence of the first language audio and a second syllable tag sequence of the second language audio; inputting the second language audio to a pre-trained first language identification model to obtain a connection time sequence classification peak sequence, wherein the first language identification model is obtained by training based on a first syllable label sequence; calculating the connection time sequence classification peak accuracy of each second syllable label in the second syllable label sequence based on the second syllable label sequence and the connection time sequence classification peak sequence; determining a difference syllable label from the second syllable label sequence based on the calculated connection timing classification peak accuracy; and performing mixed training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a mixed language identification model.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for training a hybrid language recognition model, comprising:

generating a first syllable tag sequence of the first language audio and a second syllable tag sequence of the second language audio;

inputting the second language audio to a pre-trained first language identification model to obtain a connection time sequence classification peak sequence, wherein the first language identification model is obtained by training based on the first syllable label sequence;

calculating a link timing classification peak accuracy rate for each second syllable label in the second syllable label sequence based on the second syllable label sequence and the link timing classification peak sequence;

determining a difference syllable label from the second syllable label sequence based on the calculated connection timing classification peak accuracy;

and performing mixed training on the deep neural network based on the first syllable label sequence and the difference syllable labels to obtain a mixed language identification model.

2. The method of claim 1, wherein the generating a first syllable tag sequence of first language audio comprises:

extracting mel frequency cepstrum coefficient characteristics of the first language audio;

and training a Gaussian mixture model based on the Mel frequency cepstrum coefficient characteristics and the text corresponding to the first language audio to obtain an aligned Gaussian mixture model and the first syllable label sequence.

3. The method of claim 2, wherein the generating a second syllable tag sequence of second language audio comprises:

and inputting the second language audio into the aligned Gaussian mixture model to obtain the second syllable label sequence, wherein the number of label types of the second syllable label sequence is equal to that of the first syllable label sequence.

4. The method of claim 1, wherein the calculating a connection timing classification peak correct rate for each second syllable label in the second syllable label sequence based on the second syllable label sequence and the connection timing classification peak sequence comprises:

removing the weight of the second syllable label sequence to obtain a removed weight syllable label sequence;

removing a mute frame from the de-stressed syllable label sequence to obtain an effective syllable label sequence;

and comparing the effective syllable label sequence with the connection time sequence classification peak sequence to obtain the connection time sequence classification peak accuracy of each effective syllable label in the effective syllable label sequence.

5. The method of claim 4, wherein said comparing the valid syllable label sequence and the connection timing classification peak sequence to obtain a connection timing classification peak accuracy for each valid syllable label in the valid syllable label sequence comprises:

for each effective syllable label in the effective syllable label sequence, searching a connection time sequence classification peak corresponding to the position of the effective syllable label from the connection time sequence classification peak sequence;

counting the number of the searched connection time sequence classification peaks which is the same as the number of the effective syllable labels;

and calculating the ratio of the counted number to the total number of the effective syllable labels to obtain the correct rate of the connection time sequence classification peak of the effective syllable labels.

6. The method of claim 4 or 5, wherein the determining a difference syllable label from the second syllable label sequence based on the calculated connection timing classification peak accuracy rate comprises:

and determining the effective syllable label of which the correct rate of the connection time sequence classification peak in the effective syllable label sequence is less than a preset threshold value as a difference syllable label.

7. The method of claim 1, wherein the first language recognition model is trained by:

extracting filter bank coefficient characteristics of the first language audio;

training a deep neural network based on the filter bank coefficient characteristics and the first syllable label sequence to obtain a first language identification model, wherein the node number of an output layer of the first language identification model is equal to the label type number of the first syllable label sequence.

8. The method of claim 7, wherein the hybrid training of the deep neural network based on the first sequence of syllable labels and the difference syllable labels, resulting in a hybrid language recognition model, comprises:

and performing hybrid training on a deep neural network based on the filter bank coefficient characteristics, the first syllable label sequence and the difference syllable labels to obtain a hybrid language identification model, wherein the node number of an output layer of the hybrid language identification model is equal to the sum of the label type number of the first syllable label sequence and the label type number of the difference syllable labels.

9. The method of claim 7 or 8, wherein training the first or hybrid language recognition model optimizes the deep neural network using training criteria based on connection timing classification.

10. An apparatus for training a hybrid language recognition model, comprising:

a generating unit configured to generate a first syllable tag sequence of a first language audio and a second syllable tag sequence of a second language audio;

an input unit configured to input the second language audio to a pre-trained first language identification model, resulting in a connection timing classification peak sequence, wherein the first language identification model is trained based on the first syllable label sequence;

a calculation unit configured to calculate a connection timing classification peak accuracy rate of each second syllable label in the second syllable label sequence based on the second syllable label sequence and the connection timing classification peak sequence;

a determining unit configured to determine a difference syllable label from the second syllable label sequence based on the calculated connection timing classification peak accuracy;

a training unit configured to perform hybrid training on a deep neural network based on the first syllable label sequence and the difference syllable labels, resulting in a hybrid language recognition model.

11. The apparatus of claim 10, wherein the generating unit is further configured to:

12. The apparatus of claim 11, wherein the generating unit is further configured to:

13. The apparatus of claim 10, wherein the computing unit comprises:

a de-weight subunit configured to de-weight the second syllable label sequence, resulting in a de-weighted syllable label sequence;

a de-mute subunit configured to de-mute frames of the de-emphasized syllable label sequence to obtain an effective syllable label sequence;

and the comparison subunit is configured to compare the effective syllable label sequence with the connection time sequence classification peak sequence to obtain the connection time sequence classification peak accuracy of each effective syllable label in the effective syllable label sequence.

14. The apparatus of claim 13, wherein the contrast subunit is further configured to:

15. The apparatus of claim 13 or 14, wherein the determining unit is further configured to:

16. The apparatus of claim 10, wherein the first language recognition model is trained by:

extracting filter bank coefficient characteristics of the first language audio;

17. The apparatus of claim 16, wherein the training unit is further configured to:

18. The apparatus of claim 16 or 17, wherein training the first or hybrid language recognition model optimizes the deep neural network using training criteria based on connection timing classification.

19. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.

20. A computer-readable medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, carries out the method according to any one of claims 1-9.