CN112233653A

CN112233653A - Method, device and equipment for training multi-dialect accent mandarin speech recognition model

Info

Publication number: CN112233653A
Application number: CN202011433866.3A
Authority: CN
Inventors: 胡广宇
Original assignee: Beijing Yuanjian Information Technology Co Ltd
Current assignee: Beijing Yuanjian Information Technology Co Ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-01-15
Anticipated expiration: 2040-12-10
Also published as: CN112233653B

Abstract

The application provides a method, a device and equipment for training a multi-language accent Mandarin speech recognition model, and relates to the technical field of language recognition. The method comprises the following steps: obtaining a training sample; training standard Mandarin voice data with labels to obtain an initial acoustic model, and training text data to obtain an initial language model; iteratively training an initial acoustic model based on unlabeled dialect accent Mandarin voice data to obtain a target acoustic model; training a text to be trained, which is obtained by identifying a target acoustic model and an initial language model, to obtain a temporary language model, and combining the temporary language model and the initial language model to obtain a target language model; the target acoustic model and the target language model are combined into a multi-dialect accent Mandarin speech recognition model. A large amount of unlabeled dialect accent common speech data are utilized to carry out iterative training to obtain a dialect accent common speech recognition model, and the accuracy rate of dialect accent common speech recognition is improved.

Description

Method, device and equipment for training multi-dialect accent mandarin speech recognition model

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method, a device and equipment for training a multi-language accent Mandarin Chinese voice recognition model.

Background

Along with the improvement of the performance of the internet and other mobile terminals, intelligent products based on the voice recognition technology are more and more favored by industrial production and daily life, such as voice conversation robots, voice assistants, interactive tools and the like. However, in China, people living in various regions, including a multi-dialect region, have a phenomenon of accents to a large extent when using mandarin to express, and further cause mismatching with a standard mandarin model, resulting in low accuracy of voice recognition.

At present, multi-model research can be carried out based on a time sequence neural network, and dialect accent Mandarin can be recognized. Different pronunciation dictionaries and training data are summarized aiming at different areas based on a multi-model recognition method; then, the multiple models are trained directly, or the Mandarin speech recognition model is fine-tuned.

However, the currently adopted multi-model identification method needs to consume a lot of time and manpower for data collection and sample labeling, which increases the difficulty of optimization.

Disclosure of Invention

The present invention aims to provide a method, an apparatus and a device for training a multi-lingual accent mandarin speech recognition model to overcome the shortcomings in the prior art, so as to make full use of the unlabeled speech data to enhance the training of the model, and avoid the situation of low final recognition accuracy caused by lack of the limitation of labeled training sample data in practical application.

In order to achieve the above purpose, the technical solutions adopted in the embodiments of the present application are as follows:

in a first aspect, an embodiment of the present application provides a multi-lingual accent mandarin speech recognition model training method, including:

obtaining training samples, the training samples comprising: standard common speech sound data with labels, non-labeled dialect accent common speech sound data and text data;

training by using the standard mandarin speech data with the labels to obtain an initial acoustic model, and training by using the text data to obtain an initial language model;

iteratively training the initial acoustic model based on the unlabeled dialect accent Mandarin voice data to obtain a target acoustic model, wherein the target acoustic model is used for identifying preset types of dialect accent Mandarin voice data, and each type of dialect accent Mandarin voice data respectively corresponds to one type of dialect accent Mandarin;

training a to-be-trained text obtained by identifying the target acoustic model and the initial language model to obtain a temporary language model, and combining the temporary language model and the initial language model to obtain a target language model;

combining the target acoustic model and the target language model into a multi-dialect accent Mandarin speech recognition model.

Optionally, the iteratively training the initial acoustic model based on the unlabeled dialect accent mandarin speech data to obtain a target acoustic model includes:

taking the initial acoustic model as an initial temporary acoustic model;

A. using the temporary acoustic model and the initial language model to perform recognition processing on the unlabeled dialect accent Mandarin speech data to obtain a recognition text;

B. obtaining a preset number of data sets according to the confidence degree, the language information and the region label information of the identification texts, wherein each data set comprises a plurality of identification texts, and the identification texts in the same data set correspond to dialect accent mandarins of the same type;

C. using each data set to respectively adjust the temporary acoustic models to obtain the dialect acoustic models with the preset number;

D. screening at least one alternative merging model from each dialect acoustic model according to the accuracy of each dialect acoustic model;

E. merging each alternative merging model and the temporary acoustic model to obtain a new temporary acoustic model;

and circularly executing the steps A-E until the accuracy of the temporary acoustic model meets a preset condition, and taking the temporary acoustic model as the target acoustic model.

Optionally, before performing recognition processing on the ordinary speech data of the unlabeled dialect accent by using the temporary acoustic model and the initial language model, the method further includes:

and determining language information of the un-labeled dialect accent common speech sound data by using a preset language recognition engine.

Optionally, the obtaining a preset number of data sets according to the confidence degree, the language information, and the region label information of the recognition text includes:

screening out recognition texts with confidence degrees larger than a preset threshold value from the recognition texts;

and dividing the recognition texts with the confidence degrees larger than a preset threshold value into the data sets with the preset number according to the language information and the region label information.

Optionally, the screening, according to the accuracy of each dialect acoustic model, at least one candidate merging and merging model from each dialect acoustic model includes:

respectively using each dialect acoustic model to perform recognition processing on the marked voice test set to obtain recognition results;

determining the accuracy of the dialect acoustic model according to the recognition result and the labeling information of the labeled voice test set;

and screening at least one alternative merging model from each dialect acoustic model according to the accuracy of each dialect acoustic model.

Optionally, before the training of the initial language model by using the text to be trained identified by the target acoustic model and the initial language model to obtain the target language model, the method further includes:

and recognizing and processing the standard common spoken voice data with the labels and the un-labeled dialect accent common Chinese voice data by using the target acoustic model and the initial language model to obtain the text to be trained.

Optionally, the merging the temporary language model and the initial language model to obtain a target language model includes:

determining the confusion degree of the marked voice test set by using the temporary language model and the initial language model respectively;

and carrying out interpolation processing on the temporary language model and the initial language model according to the confusion degree to obtain the target language model.

In a second aspect, an embodiment of the present application further provides a multi-lingual accent mandarin speech recognition model training device, where the device includes: the training system comprises an acquisition module, a training module and a combination module;

the obtaining module is configured to obtain a training sample, where the training sample includes: standard common speech sound data with labels, non-labeled dialect accent common speech sound data and text data;

the training module is used for training by using the standard mandarin speech data with the labels to obtain an initial acoustic model and training by using the text data to obtain an initial language model; iteratively training the initial acoustic model based on the unlabeled dialect accent Mandarin voice data to obtain a target acoustic model, wherein the target acoustic model is used for identifying preset types of dialect accent Mandarin voice data, and each type of dialect accent Mandarin voice data respectively corresponds to one type of dialect accent Mandarin; training the initial language model by using the text to be trained, which is obtained by identifying the target acoustic model and the initial language model, so as to obtain a target language model;

the combination module is used for combining the target acoustic model and the target language model into a multi-dialect accent Mandarin Chinese speech recognition model.

Optionally, the training module is further configured to:

taking the initial acoustic model as an initial temporary acoustic model;

Optionally, the training module is further configured to:

Optionally, the training module is further configured to recognize the standard normal spoken voice data with the label and the non-labeled dialect accent mandarin voice data by using the target acoustic model and the initial language model to obtain the text to be trained.

Optionally, the training module is further configured to:

In a third aspect, an embodiment of the present application further provides a multi-lingual accent mandarin speech recognition model training device, including: a processor, a storage medium and a bus, the storage medium storing machine readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the multi-lingual accent Mandarin Chinese speech recognition model training device is operating, the processor executing the machine readable instructions to perform the steps of the method as provided by the first aspect.

In a fourth aspect, the present application further provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the method as provided in the first aspect.

The beneficial effect of this application is:

the embodiment of the application provides a method, a device and equipment for training a multi-language accent Mandarin speech recognition model, which comprises the following steps: obtaining training samples, the training samples comprising: standard common speech sound data with labels, non-labeled dialect accent common speech sound data and text data; training standard Mandarin voice data with labels to obtain an initial acoustic model, and training text data to obtain an initial language model; iteratively training an initial acoustic model based on unlabeled dialect accent Mandarin voice data to obtain a target acoustic model, wherein the target acoustic model is used for identifying preset types of dialect accent Mandarin voice data, and each type of dialect accent Mandarin voice data corresponds to one type of dialect accent Mandarin; training a text to be trained, which is obtained by identifying a target acoustic model and an initial language model, to obtain a temporary language model, and combining the temporary language model and the initial language model to obtain a target language model; the target acoustic model and the target language model are combined into a multi-dialect accent Mandarin speech recognition model. In the scheme, a large amount of un-labeled dialect accent common speech data is fully utilized to strengthen the iterative training of the target acoustic model and the target language model, and the condition that the final recognition precision is not high due to the lack of the limitation of labeled training sample data in practical application is avoided; in addition, the obtained multi-aspect accent Mandarin speech recognition model not only improves the accuracy of the accent Mandarin speech recognition, but also improves the recognition rate of the standard Mandarin.

In addition, the dialect accent Mandarin models are finally combined into one model, so that the complicated process of needing a plurality of voice recognition models when the dialect accent Mandarin is recognized is avoided, and the efficiency of recognizing the dialect accent Mandarin voice data is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a schematic structural diagram of a multi-lingual accent Mandarin speech recognition model training apparatus according to an embodiment of the present application;

FIG. 2 is a schematic flowchart illustrating a method for training a multi-lingual accent Mandarin speech recognition model according to an embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating another method for training a dialect accent Mandarin speech recognition model according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a method for training a multi-lingual accent Mandarin speech recognition model according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating another method for training a dialect accent Mandarin speech recognition model according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating a method for training a multi-lingual accent Mandarin speech recognition model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a multi-lingual accent mandarin speech recognition model training device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention.

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Additionally, it should be understood that the schematic drawings are not necessarily drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.

FIG. 1 is a schematic structural diagram of a multi-lingual accent Mandarin speech recognition model training apparatus according to an embodiment of the present application; the multi-lingual accent mandarin speech recognition model training device 100 may be a processing device such as a computer or a server, for example, to implement the multi-lingual accent mandarin speech recognition model training method of the present application. As shown in fig. 1, the multi-lingual accent mandarin speech recognition model training apparatus 100 includes: a processor 101 and a memory 102.

The processor 101 and the memory 102 are electrically connected directly or indirectly to realize data transmission or interaction. For example, electrical connections may be made through one or more communication buses or signal lines.

The processor 101 may be an integrated circuit chip having signal processing capability. The Processor 101 may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), and the like. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The Memory 102 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

It is to be understood that the configuration depicted in FIG. 1 is merely illustrative and that the multi-lingual accent Mandarin speech recognition model training apparatus 100 may also include more or fewer components than shown in FIG. 1 or have a different configuration than shown in FIG. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.

The memory 102 is used for storing a program, and the processor 101 calls the program stored in the memory 102 to execute the multi-lingual accent mandarin speech recognition model training method provided by the following embodiments.

Fig. 2 is a schematic flowchart of a method for training a multi-lingual accent mandarin speech recognition model according to an embodiment of the present application, and optionally, an execution subject of the method may be a server, a computer, or a like device, and has a data processing function. As shown in fig. 2, the method includes:

s201, obtaining a training sample, wherein the training sample comprises: standard normal speech sound data with labels, non-labeled dialect accent normal speech sound data, and text data.

In specific implementation, the training samples can be obtained by opening a source database, or by intercepting audio data and manually labeling the intercepted audio data.

It can be understood that the labeled training samples are important factors for improving the recognition performance of the training model. The higher the marking precision of the training sample is, the higher the recognition precision of the model obtained by training is; the lower the labeling precision is, the lower the recognition precision of the trained model is. However, a lot of time and labor are needed for labeling the samples, and the precision of training sample labeling can be specifically set according to the actual identification requirement of the model, which is not limited herein.

S202, training by using standard Mandarin voice data with labels to obtain an initial acoustic model, and training by using text data to obtain an initial language model.

In an implementation manner, a Time-Delay Neural Network (TDNN) model may be selected as a training model, and standard normal speech and voice data with labels are input into the Time-Delay Neural Network model, and an initial acoustic model is obtained through training.

Alternatively, considering that the context phoneme has an influence on the current central phoneme pronunciation, and a synergistic change is generated, an initial acoustic model based on triphones can be trained, so that the initial acoustic model can describe the speech more accurately.

In one implementation, an initial language model based on a 4-gram (4-gram) is trained using text data, where a 4-gram means that the probability of a word occurring is only related to the 3 words that precede it.

In particular, according to the definition of n-grams

The probability of occurrence is:

（1）

then the conditional probability of the 4-way grammar can be obtained, which is simplified as:

（2）

and (4) estimating the probability of the sentence (word sequence) according to the initial language model obtained by training.

S203, iteratively training the initial acoustic model based on the unlabeled dialect accent Mandarin voice data to obtain a target acoustic model.

The target acoustic model is used for recognizing preset types of dialect accent Mandarin voice data, and each type of dialect accent Mandarin voice data corresponds to one type of dialect accent Mandarin.

For example, the unlabeled dialect accent mandarin speech data may include a preset kind of dialect accent normal speech sound data, such as seven major dialect accent normal speech sound data.

It should be noted that the output of the target acoustic model obtained above may be probability distribution information of voice data, rather than the final voice recognition result, and the target acoustic model needs to be used in combination with the target language model obtained below to realize the voice recognition of the dialect accent mandarin.

In an implementation manner, the initial acoustic model and the initial language model are used to decode the un-labeled dialect accent common speech data to obtain the corresponding label text of the seven labeled dialect accent mandarins.

And iterating the initial acoustic model based on the dialect accent common speech data and the corresponding label text to obtain a target acoustic model.

S204, training the text to be trained, which is obtained by identifying the target acoustic model and the initial language model, to obtain a temporary language model, and combining the temporary language model and the initial language model to obtain the target language model.

In one implementation, the unlabeled dialect accent Mandarin speech data is decoded using the target acoustic model and the initial language model to obtain labeled dialect accent Mandarin text data, which can be trained as a text to be trained to obtain a 4-gram-based dialect accent Mandarin language model, which can also be referred to as a temporary language model.

And then, combining the obtained temporary language with the initial language model to obtain a target language model.

Meanwhile, the target acoustic model and the target language model are automatically updated according to the service scene.

S205, combining the target acoustic model and the target language model into a multi-dialect accent Mandarin speech recognition model.

For example, a sentence of the northern dialect accent Mandarin speech audio data "this child is" is input to the multi-dialect accent Mandarin speech recognition model for recognition, and the corresponding text data of the dialect accent Mandarin speech "this child is" can be obtained.

The experimental demonstration proves that the multi-dialect accent mandarin speech recognition model provided by the application has the advantages of high recognition performance, strong robustness and the like, the multi-dialect accent mandarin speech recognition model is used, the accuracy rate of dialect accent mandarin recognition in each region is improved by 3-5 percentage points, meanwhile, the recognition rate of standard mandarin is also improved to a certain extent, and the problem that the performance of the dialect accent mandarin on the standard mandarin model is reduced greatly is effectively solved.

In summary, the embodiment of the present application provides a method for training a multi-lingual accent mandarin speech recognition model, including: obtaining training samples, the training samples comprising: standard common speech sound data with labels, non-labeled dialect accent common speech sound data and text data; training standard Mandarin voice data with labels to obtain an initial acoustic model, and training text data to obtain an initial language model; iteratively training an initial acoustic model based on unlabeled dialect accent Mandarin voice data to obtain a target acoustic model, wherein the target acoustic model is used for identifying preset types of dialect accent Mandarin voice data, and each type of dialect accent Mandarin voice data corresponds to one type of dialect accent Mandarin; training the initial language model by using a text to be trained, which is obtained by identifying the target acoustic model and the initial language model, so as to obtain a target language model; the target acoustic model and the target language model are combined into a multi-dialect accent Mandarin speech recognition model. In the scheme, a large amount of un-labeled dialect accent common speech data is fully utilized to strengthen the iterative training of the target acoustic model and the target language model, so that the condition that the final recognition precision is not high due to the lack of the limitation of labeled training sample data in practical application is avoided; in addition, the obtained multi-aspect accent Mandarin speech recognition model not only improves the accuracy of the accent Mandarin speech recognition, but also improves the recognition rate of the standard Mandarin.

FIG. 3 is a schematic flow chart illustrating another method for training a dialect accent Mandarin speech recognition model according to an embodiment of the present application; as shown in fig. 3, the step S203: iteratively training an initial acoustic model based on unlabeled dialect accent Mandarin speech data to obtain a target acoustic model, comprising:

the method is based on the general standard mandarin speech recognition technology, and a set of robust training method is provided for the dialect accent mandarin speech recognition, so that the accuracy of the dialect accent mandarin speech recognition can be improved, and the standard mandarin speech recognition rate is improved to a certain extent.

The initial acoustic model may be used as an initial temporary acoustic model, the temporary acoustic model is continuously updated by executing a loop, and when the temporary acoustic model obtained in a certain loop satisfies a preset condition, the loop is ended, and the temporary acoustic model at this time is used as a target acoustic model.

It is worth noting that in the following loop, the initial language model need not be changed.

The method comprises the following specific steps:

s301, using the temporary acoustic model and the initial language model to recognize the unlabeled dialect accent Mandarin voice data to obtain a recognition text.

Inputting the un-labeled dialect accent common speech data into the temporary acoustic model and the initial language model for identification to obtain an identification text, namely labeled dialect accent common speech text data.

S302, obtaining a preset number of data sets according to confidence degrees, language information and region label information of the recognition texts, wherein each data set comprises a plurality of recognition texts, and the recognition texts in the same data set correspond to dialect accent Mandarin of the same type.

The language information refers to dialect accent mandarin speech data of a preset type, and may include seven dialect languages, such as northern dialect, Wu dialect, Xiang dialect, gan dialect, Hakka dialect, Min dialect and Guangdong dialect, for example.

The regional label information indicates that different dialect accent common speech voice data have corresponding regional label information, for example, dialect accent common speech voice data a belongs to a northern dialect area, and dialect accent common speech voice data B belongs to a Wufang dialect area.

It is to be understood that, in the above temporary acoustic model and the initial language model, the target of the recognition process is un-labeled dialect accent common speech data, wherein the un-labeled dialect accent common speech data may include preset types of dialect accent common speech data, such as seven-large dialect accent common speech data, and the obtained recognized text also includes: seven marked dialect accent mandarin text data.

The confidence coefficient of the recognized text refers to the confidence coefficient of the sentences in the seven marked dialect accent mandarin text data which are obtained by adopting a confidence coefficient algorithm.

For example, a Word Lattice (Word Lattice) based confidence algorithm is used for calculating the confidence of each sentence in seven labeled dialect accent mandarin text data included in the recognition text, the recognition text with high confidence is screened out, 7 different dialect accent mandarin text data sets are obtained, each dialect accent mandarin text data set can include a plurality of recognition texts, and the recognition texts in the same type of dialect accent mandarin text data set correspond to the same type of dialect accent mandarin, so that the duration distribution of the dialect accent mandarin text data of each region is consistent, and the duration distribution of each speaker is consistent.

And S303, adjusting the temporary acoustic models by using each data set respectively to obtain dialect acoustic models with preset quantity.

For example, the provisional acoustic models are respectively fine-tuned by using the dialect accent mandarin data of the same type corresponding to the 7 selected dialect accent mandarin text data sets, so as to obtain dialect acoustic models corresponding to 7 different regions, for example, dialect acoustic model 1, dialect acoustic model 2, dialect acoustic model 3, dialect acoustic model 4, dialect acoustic model 5, dialect acoustic model 6, and dialect acoustic model 7.

S304, screening out at least one alternative merging model from the acoustic models of all dialects according to the accuracy of the acoustic models of all dialects.

For example, the dialect acoustic models corresponding to 7 different regions obtained above are decoded respectively from the test set of dialect accent common speech data with labels in the corresponding regions, and the accuracy (or word error rate) of each dialect acoustic model is calculated.

And respectively determining whether the accuracy of each dialect acoustic model is greater than the decoding accuracy of the temporary acoustic model, and if so, taking the dialect acoustic model as a candidate merging model.

For example, after comparison, if the accuracy of the dialect acoustic models 3, 4, and 5 is found to be greater than the accuracy of the provisional acoustic model identification, the dialect acoustic models 3, 4, and 5 are used as candidate merged models.

It should be noted that the accuracy of the dialect acoustic models 1, 2, 6, and 7 is less than the accuracy of the temporary acoustic model identification, that is, they cannot be used as alternative merged models.

And S305, combining the alternative combined models and the temporary acoustic model to obtain a new temporary acoustic model.

For example, on the basis of the above embodiment, the selected dialect acoustic model 3, dialect acoustic model 4, dialect acoustic model 5 and temporary acoustic model are combined to obtain a new temporary acoustic model.

In an implementation manner, the following merging method may be adopted to merge the candidate merged models with the temporary acoustic model, specifically as follows:

（3）

wherein the content of the first and second substances,

representing a sequence of input features, S₁、S₂、……、S₇Hidden Markov states, S, representing the spoken Mandarin model for each dialect_bHidden Markov states, 0, representing temporary acoustic models<

、

、……、

、

<1 denotes a weight coefficient of the image signal,

，

for temporary acoustic model weight coefficients, each weight coefficient may be identified based on a particular data set test result. If can be used as

The weight value of (3) is set to 0.8 and the other coefficients are an average of 0.2.

S306, judging whether the accuracy of the temporary acoustic model meets a preset condition, if so, executing the step G; if not, jumping to the step A, and circularly executing the steps A-E.

For example, the preset condition may be: compared with the temporary acoustic model obtained at the last time, the accuracy improvement amplitude of the temporary acoustic model obtained at this time for identifying the test set of the marked dialect accent common speech data is smaller than 0.001%, which indicates that the accuracy of the temporary acoustic model obtained through multiple iterations cannot be obviously improved.

For example, after the 3 rd iteration, inputting the test set of the marked dialect accent common speech data into the temporary acoustic model, and calculating to obtain the accuracy of 96.084%, where compared with the accuracy of 96.079% of the temporary acoustic model obtained by the 2 nd iteration, the accuracy after the current iteration training is improved by 0.05%, but the accuracy does not meet the preset condition and is less than 0.001%, then jumping to step S301, and executing steps S301-S306 in a loop.

For another example, after the 4 th iteration, the accuracy of the test set recognition of the ordinary spoken utterance data of the dialect accent with the label by the temporary acoustic model obtained through calculation is 96.0846%, which is compared with the accuracy 96.084% of the temporary acoustic model obtained through the 3 rd iteration, the accuracy after the current iteration training is improved by 0.0006%, and the preset condition is satisfied to be less than 0.001%, which indicates that the accuracy of the temporary acoustic model obtained at this time cannot be obviously improved, and then step G is executed.

And S307, taking the temporary acoustic model as a target acoustic model.

For example, on the basis of the above embodiment, the temporary acoustic model obtained after the 4 th iteration may be used as the target acoustic model.

In this embodiment, the 7 selected different dialect accent mandarin data are utilized to respectively perform fine tuning on the temporary acoustic models to obtain dialect acoustic models corresponding to the 7 different regions, at least one alternative merging model with high recognition accuracy is selected from the 7 dialect acoustic models, the selected alternative merging model and the temporary acoustic model are merged to finally obtain a target acoustic model, so that a complicated process that a plurality of acoustic models are needed when the multiparty accent mandarin is recognized is avoided, and meanwhile, the efficiency of recognizing the multiparty accent mandarin voice data is also improved.

Optionally, in step S301: before the non-labeled dialect accent common speech data is subjected to recognition processing by using the temporary acoustic model and the initial language model, the method further comprises the following steps:

language information of unlabeled dialect accent common speech sound data is determined using a preset language recognition engine.

The language identification engine can identify the input voice data and extract the language information of the language feature of the voice data.

For example, if the unlabeled dialect accent mandarin speech data includes different dialect accent ordinary speech data, a preset language identification engine may be used to identify and classify a plurality of dialect accent ordinary speech data included in the unlabeled dialect accent mandarin speech data, so as to obtain language information, for example, it may be determined that one of the unlabeled dialect accent ordinary speech data belongs to the wufang dialect.

FIG. 4 is a flowchart illustrating a method for training a multi-lingual accent Mandarin speech recognition model according to an embodiment of the present application; as shown in fig. 4, obtaining a preset number of data sets according to the confidence level, language information, and region label information of the recognized text includes:

s401, screening out the recognition texts with the confidence degrees larger than a preset threshold value from the recognition texts.

It is to be understood that a plurality of dialect accent mandarin text data are included in the recognized text.

In an implementation, the confidence of the recognized text needs to be determined first, and for example, the confidence of the recognized text can be calculated by using a Word Lattice (Word Lattice) confidence algorithm. For each arc on Lattice, an arc-based posterior probability is calculated, and then the posterior probabilities of all arcs of the same noun that intersect the arc in time are added up as the confidence of the word represented by the arc. The specific accumulation method is as follows:

（4）

wherein the content of the first and second substances,

for start and end time, and

and adding the confidence degrees of all the words to average to be used as the confidence degree of the sentence after calculating the confidence degree of each word in the sentence.

After the confidence coefficient of each dialect accent Mandarin text in the recognition text is calculated, whether the confidence coefficient of each dialect accent Mandarin text is larger than the recognition text with a preset threshold value or not is respectively determined, and the recognition text with the confidence coefficient larger than the preset threshold value is screened out.

S402, dividing the recognition texts with the confidence degrees larger than a preset threshold value into a preset number of data sets according to the language information and the region label information.

For example, if the analysis determines that the confidence of the wufang dialect accent mandarin identification text is greater than the identification text of the preset threshold, the wufang dialect accent mandarin identification text may be divided into corresponding regions in the seventy-large-dialect accent mandarin data set, so that the identification text in the wufang dialect accent mandarin text data set corresponds to the same class of wufang dialect accent mandarin.

FIG. 5 is a flowchart illustrating another method for training a dialect accent Mandarin speech recognition model according to an embodiment of the present application; as shown in fig. 5, screening out at least one candidate merging model from the dialect acoustic models according to the accuracy of the dialect acoustic models, includes:

and S501, respectively using the dialect acoustic models to perform recognition processing on the marked voice test set to obtain recognition results.

Wherein the labeled voice test set can be a test set of dialect accent mandarin data labeled by each zone.

For example, a test set of labeled Wufang dialect accent Mandarin data may be identified using the Wufang dialect acoustic model to obtain an identification text of the Wufang dialect accent Mandarin.

And S502, determining the accuracy of the dialect acoustic model according to the recognition result and the marking information of the marked voice test set.

The accuracy rate is the ratio of the number of correctly identified samples to the total number of samples in the test set, and can be used for evaluating the accuracy rate of dialect acoustic model identification.

For example, after calculating the accuracy of each dialect acoustic model to the labeled speech test set recognition, the accuracy of the obtained wu dialect acoustic model is 80%.

S503, screening at least one alternative merging model from the acoustic models of all dialects according to the accuracy of the acoustic models of all dialects.

For example, whether the accuracy of each dialect acoustic model is greater than a preset threshold value is judged, and if yes, the dialect acoustic model is screened out. And if the preset threshold is 75%, determining that the accuracy of the acquired Wufang dialect acoustic model is more than 80% of the preset threshold, and screening the Wufang dialect acoustic model as an alternative merging model.

And whether the accuracy of other dialect acoustic models is greater than a preset threshold value or not can be continuously judged, if yes, the dialect acoustic models are screened out to be used as alternative merging models, and a plurality of alternative merging models are obtained.

Optionally, the training of the initial language model using the text to be trained identified by the target acoustic model and the initial language model further includes, before obtaining the target language model:

and recognizing and processing the labeled standard common spoken voice data and the unlabeled dialect accent common speech data by using the target acoustic model and the initial language model to obtain a text to be trained.

For example, using the target acoustic model and the initial language model to perform recognition processing on the labeled standard Mandarin speech data, labeled standard Mandarin recognition text can be obtained.

Furthermore, the target acoustic model and the initial language model are used for carrying out recognition processing on the un-labeled dialect accent Mandarin speech data, and a labeled dialect accent common recognition text can be obtained. The standard mandarin Chinese recognized text with labels and the dialect accent common recognized text with labels can also be called as the text to be trained.

Optionally, because the target acoustic model is already a model with high accuracy after training, the target acoustic model and the initial language model may be used to perform recognition processing on the standard normal spoken utterance data with labels and other unlabeled dialect accent mandarin voice data besides the unlabeled dialect accent mandarin voice data, and a text obtained by the recognition processing is used as a text to be trained, which is not specifically limited in this embodiment of the present application.

FIG. 6 is a flowchart illustrating a method for training a multi-lingual accent Mandarin speech recognition model according to an embodiment of the present application; as shown in fig. 6, merging the temporary language model and the initial language model to obtain the target language model includes:

s601, determining the confusion degree of the marked voice test set by using the temporary language model and the initial language model respectively.

Wherein, the marked voice test set can be a set of dialect accent Mandarin test set marked by each zone and a standard Mandarin test set marked by each zone.

The Perplexity (PPL for short) is an index used in the field of natural language processing to measure the quality of a language model, and the lower the PPL value is, the better the corresponding language model is. According to the definition of PPL, the calculation formula is as follows:

（5）

wherein S represents a current sentence; n represents the sentence length;

representing the probability of calculating the ith word based on the first i-1 words.

For example, the confusion of the labeled speech test set is calculated using the temporary language model and the initial language model, respectively, wherein the PPL value of the temporary language model is 101 and the PPL value of the initial language model is 155.

And S602, performing interpolation processing on the temporary language model and the initial language model according to the confusion degree to obtain a target temporary language model.

In a practical way, for example, the temporary language model and the initial language model are fused by interpolation to obtain a target temporary language model, but the interpolation requires setting of weighting coefficients.

Therefore, in the embodiment of the application, it is considered that an optimal interpolation coefficient can be determined according to the PPL value, so that after the temporary language model and the initial language model are interpolated, the obtained PPL value of the target language model is minimum, and the obtained target language model is ensured to have an optimal recognition effect. The specific implementation method comprises the following steps:

（6）

λ is a weighting coefficient of the initial language model, and 0< λ <1, P (W), P (W1), and P (W2) respectively represent probabilities of words corresponding to the target language model, the initial language model, and the provisional language model based on the 4-gram.

If the interpolation coefficient lambda is close to 1, the initial language model occupies a dominant position; whereas it is closer to 0 the temporary language model is more important.

For example, when the first interpolation is performed, λ is 0.5, and the obtained PPL value of the target language model is 108; after multiple times of interpolation, analysis determines that when the lambda is 0.74, the obtained target language model has the minimum PPL value of 102, and the obtained target language model is the optimal model.

It can be understood that the identification of the multi-aspect accent Mandarin text data can be realized by obtaining the optimal target language model, and the convenience of the identification of the multi-aspect accent Mandarin text data is greatly improved.

Optionally, the target acoustic model and the target language model obtained in the above embodiments may be combined into a multi-dialect accent mandarin speech recognition model, so that accuracy and efficiency of the multi-dialect accent mandarin speech recognition are improved.

The following describes a device and a storage medium for executing the method for training a mandarin chinese speech recognition model with a multi-lingual accent provided by the present application, and specific implementation procedures and technical effects thereof are described above and will not be described again below.

FIG. 7 is a schematic structural diagram of a multi-lingual accent Mandarin speech recognition model training apparatus according to an embodiment of the present application; as shown in fig. 7, the apparatus includes: an acquisition module 701, a training module 702 and a combination module 703;

an obtaining module 701, configured to obtain a training sample, where the training sample includes: standard common speech sound data with labels, non-labeled dialect accent common speech sound data and text data;

a training module 702, configured to use standard mandarin speech data with labels to train to obtain an initial acoustic model, and use text data to train to obtain an initial language model; iteratively training an initial acoustic model based on unlabeled dialect accent Mandarin voice data to obtain a target acoustic model, wherein the target acoustic model is used for identifying preset types of dialect accent Mandarin voice data, and each type of dialect accent Mandarin voice data corresponds to one type of dialect accent Mandarin; training the initial language model by using a text to be trained, which is obtained by identifying the target acoustic model and the initial language model, so as to obtain a target language model;

a combining module 703 for combining the target acoustic model and the target language model into a dialect accent mandarin speech recognition model.

Optionally, the training module 702 is further configured to:

taking the initial acoustic model as an initial temporary acoustic model;

A. using a temporary acoustic model and an initial language model to identify and process the unlabeled dialect accent Mandarin voice data to obtain an identification text;

B. obtaining a preset number of data sets according to the confidence coefficient, the language information and the region label information of the identification texts, wherein each data set comprises a plurality of identification texts, and the identification texts in the same data set correspond to dialect accent mandarin of the same type;

C. using each data set to respectively adjust the temporary acoustic models to obtain dialect acoustic models with preset quantity;

D. screening at least one alternative merging model from the acoustic models of all dialects according to the accuracy of the acoustic models of all dialects;

and circularly executing the steps A-E until the accuracy of the temporary acoustic model meets the preset condition, and taking the temporary acoustic model as the target acoustic model.

Optionally, the training module 702 determines language information of the unlabeled dialect accent common speech sound data using a preset language recognition engine.

Optionally, the training module 702 is further configured to screen out, from the recognition texts, recognition texts with confidence degrees greater than a preset threshold; and dividing the recognition texts with the confidence degrees larger than a preset threshold value into a preset number of data sets according to the language information and the region label information.

Optionally, the training module 702 is further configured to perform recognition processing on the labeled voice test set by using each dialect acoustic model respectively to obtain a recognition result; determining the accuracy of the dialect acoustic model according to the recognition result and the labeling information of the labeled voice test set; and screening at least one alternative merging model from the acoustic models of all dialects according to the accuracy of the acoustic models of all dialects.

Optionally, the training module 702 is further configured to perform recognition processing on the labeled standard normal speech data and the unlabeled dialect accent normal speech data by using the target acoustic model and the initial language model to obtain a text to be trained.

Optionally, the training module 702 is further configured to determine the confusion of the labeled speech test set by using the temporary language model and the initial language model respectively; and carrying out interpolation processing on the temporary language model and the initial language model according to the confusion degree to obtain a target language model.

The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Optionally, the present invention further provides a multi-lingual accent mandarin speech recognition product, so as to execute the multi-lingual accent mandarin speech recognition model obtained by training in the embodiment of the present application, so that the multi-lingual accent mandarin speech recognition product has a speech recognition system with high recognition performance and strong robustness. The multi-language accent Mandarin recognition product can be applied to various products needing language interaction; for example, a user provides a meeting record while in a meeting; or the user inputs voice under various scenes such as voice input, voice search, voice instructions and the like, and executes related operations according to the characters after recognizing the voice as the characters.

Optionally, the invention also provides a program product, for example a computer-readable storage medium, comprising a program which, when being executed by a processor, is adapted to carry out the above-mentioned method embodiments.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A multi-aspect accent Mandarin speech recognition model training method is characterized by comprising the following steps:

2. The method of claim 1, wherein iteratively training the initial acoustic model based on the unlabeled dialect accent mandarin speech data to obtain a target acoustic model comprises:

taking the initial acoustic model as an initial temporary acoustic model;

3. The method of claim 2, wherein prior to performing recognition processing on the unlabeled dialect accent generic speech data using the provisional acoustic model and the initial language model, further comprising:

4. The method according to claim 3, wherein obtaining a preset number of data sets according to the confidence level, the language information, and the region label information of the recognition text comprises:

5. The method of claim 2, wherein the screening at least one candidate merged combined model from each dialect acoustic model according to the accuracy of each dialect acoustic model comprises:

6. The method according to any one of claims 1-5, wherein before the training of the initial language model using the text to be trained identified by the target acoustic model and the initial language model to obtain the target language model, the method further comprises:

7. The method of claim 6, wherein merging the temporary language model with the initial language model to obtain a target language model comprises:

8. A multi-lingual accent mandarin chinese speech recognition model training apparatus, the apparatus comprising: the training system comprises an acquisition module, a training module and a combination module;

9. The apparatus of claim 8, wherein the training module is further configured to:

taking the initial acoustic model as an initial temporary acoustic model;

10. A multi-aspect accent Mandarin speech recognition model training apparatus, comprising: a processor, a storage medium and a bus, the storage medium storing machine readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the dialect mandarin chinese speech recognition model training device is operating, the processor executing the machine readable instructions to perform the steps of the method according to any one of claims 1-7.