CN114078468B

CN114078468B - Voice multi-language recognition method, device, terminal and storage medium

Info

Publication number: CN114078468B
Application number: CN202210058785.2A
Authority: CN
Inventors: 张辽
Original assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Current assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-05-13
Anticipated expiration: 2042-01-19
Also published as: CN114078468A; WO2023138286A1

Abstract

The application provides a voice multi-language identification method, a device, a terminal and a storage medium, wherein the method comprises the following steps: acquiring voice data to be recognized and a multilingual acoustic model; the multilingual acoustic model is obtained based on fusion of shared hidden layers of a plurality of mixed bilingual models; obtaining confidence degrees aiming at various languages according to the voice data to be recognized and the multilingual acoustic model; and determining the language corresponding to the voice data to be recognized based on the confidence coefficient aiming at each language. The multilingual acoustic model obtained by fusing the shared hidden layers based on the multiple mixed bilingual models identifies the multilingual languages of the voice, and based on the shared hidden layers in the models, the calculated amount in the traditional multilingual identification model is reduced, the efficiency of identifying the languages is improved, and further the user experience is improved.

Description

Voice multi-language recognition method, device, terminal and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method for recognizing multiple languages in a speech, a corresponding device for recognizing multiple languages in a speech, a corresponding vehicle-mounted terminal, and a computer-readable storage medium.

Background

With the increasing maturity of the related art of artificial intelligence, more and more intelligent devices enter the lives of users, and human-machine interaction is becoming common. The voice input is as the interactive mode that nature is convenient again among the human-computer interaction, realizes the purpose of liberation both hands, and present smart machine mostly has the speech recognition function, and the speech recognition function improves user's convenience. At present, voice data to be recognized may not only be voice of a single language, but also may be mixed voice of a dual language or mixed voice of multiple languages, and for the construction of multiple mixed multilingual recognition models, the method mainly includes modeling acoustic models of each group of mixed dual languages, such as english, and the like, and outputting a language recognition mode of scores based on multiple groups of acoustic models.

Disclosure of Invention

In view of the above, the present application is proposed to provide a method for multi-lingual speech recognition, a corresponding apparatus for multi-lingual speech recognition, a corresponding vehicle terminal and a computer-readable storage medium that overcome or at least partially address the above problems.

The application discloses a multi-language voice recognition method, which comprises the following steps:

acquiring voice data to be recognized and a multilingual acoustic model; the multilingual acoustic model is obtained based on fusion of shared hidden layers of a plurality of mixed bilingual models;

obtaining confidence degrees aiming at various languages according to the voice data to be recognized and the multilingual acoustic model;

and determining the language corresponding to the voice data to be recognized based on the confidence coefficient aiming at each language.

In the multilingual speech recognition method, the method further comprises:

and decoding the voice data to be recognized, and displaying the decoded voice data in real time.

In the multi-language recognition method of the voice, the hidden layers of the mixed bilingual models comprise a bottom hidden layer and a high hidden layer which are distinguished according to a preset proportion, and the bottom hidden layer is used for combining and generating a shared hidden layer;

the obtaining confidence degrees aiming at various languages according to the voice data to be recognized and the multilingual acoustic model comprises the following steps:

inputting voice data to be recognized into a shared hidden layer of a plurality of mixed bilingual models to obtain a first output result;

inputting the first output results into the high-level hidden layers of the mixed bilingual models respectively to obtain a plurality of second output results;

and combining the second output results to be used as an input item of a preset language classification model to obtain a plurality of confidence coefficients aiming at each language.

In the method for multi-language recognition of speech, the merging the second output results as an input item of a classification model of a preset language to obtain confidence levels for each language includes:

and splicing the multi-dimensional feature vectors used for representing the second output result according to corresponding dimensions, and taking the spliced feature vectors as input items of the preset language classification model to obtain a plurality of confidence coefficients aiming at different languages.

In the method for recognizing multiple languages of speech, the determining the language corresponding to the speech data to be recognized based on the confidence degrees for the languages includes:

if the confidence coefficient is greater than the preset value and only one confidence coefficient is greater than the preset value, determining the language corresponding to the confidence coefficient as the language corresponding to the voice data to be recognized;

or if two or more confidence degrees are larger than a preset value, determining the language corresponding to the maximum confidence degree value as the language corresponding to the voice data to be recognized;

or, if the confidence degrees do not reach the preset value, the language corresponding to the maximum confidence degree value is the language of the voice data to be recognized.

In the method for multi-language recognition of speech, the displaying the decoded speech data in real time includes:

before determining the language corresponding to the voice data to be recognized, decoding the voice data to be recognized and displaying the decoded voice data in a preset language;

after the language corresponding to the voice data to be recognized is determined, decoding the voice data to be recognized by adopting a mixed bilingual model corresponding to the determined language, and continuously performing replacement display on the decoded voice data in the determined language.

In the method for recognizing multiple languages of speech, the acoustic model of multiple languages includes a preset language model, and before determining the language corresponding to the speech data to be recognized, parsing the speech data to be recognized and displaying the parsed speech data in the preset language includes:

inputting the first output result into a hidden layer of a preset language model to obtain a third output result; wherein the preset language model is positioned on an output layer of the shared hidden layer;

and before determining the language corresponding to the voice data to be recognized, decoding the third output result to obtain recognized voice information, and displaying the recognized voice information in a preset language.

In the method for multi-language recognition of speech, the high-level hidden layers of the mixed bilingual models respectively and independently form a mixed output layer, and after the language corresponding to the speech data to be recognized is determined, the mixed bilingual model corresponding to the determined language is adopted to analyze the speech data to be recognized, and the analyzed speech data is continuously displayed in the determined language, which includes:

and after determining the language corresponding to the voice data to be recognized, replacing and displaying the displayed voice information in the determined language by adopting the mixed bilingual model corresponding to the language of the voice data to be recognized.

In the multilingual speech recognition method, the multilingual acoustic model is established based on a neural network, and the method further comprises the following steps:

dividing hidden layers of a plurality of mixed bilingual models into a bottom hidden layer and a high hidden layer according to a preset proportion, and combining the bottom hidden layer to generate a shared hidden layer, wherein the plurality of mixed bilingual models are neural networks comprising the plurality of hidden layers, and the high hidden layers of the plurality of mixed bilingual models have language features corresponding to the mixed bilingual models;

adding a hidden layer of a preset language model on an output layer of the shared hidden layer;

combining a plurality of output layers of the high-level hidden layer with language characteristics corresponding to the mixed bilingual models to serve as input layers of a preset language classification model to construct a preset language classification model, and independently constructing the mixed output layers by the high-level hidden layers of the mixed bilingual models;

and generating a multilingual acoustic model by adopting the shared hidden layer, the hidden layer of the preset language model, the high-level hidden layer, the preset language classification model and the mixed output layer.

The application also discloses a multilingual recognition device of pronunciation, the device includes:

the multilingual acoustic model acquisition module is used for acquiring the voice data to be recognized and the multilingual acoustic model; the multilingual acoustic model is obtained based on fusion of shared hidden layers of a plurality of mixed bilingual models;

the confidence coefficient generation module is used for obtaining confidence coefficients aiming at all languages according to the voice data to be recognized and the multilingual acoustic model;

and the language identification module is used for determining the language corresponding to the voice data to be identified based on the confidence coefficient aiming at each language.

In the multilingual speech recognition apparatus, the apparatus further comprises: and the upper screen display module is used for decoding the voice data to be recognized and displaying the decoded voice data in real time. The upper screen display module is specifically used for decoding the voice data to be recognized and displaying the decoded voice data in a preset language before determining the language corresponding to the voice data to be recognized; and after the language corresponding to the voice data to be recognized is determined, decoding the voice data to be recognized by adopting a mixed bilingual model corresponding to the determined language, and continuously performing replacement display on the decoded voice data in the determined language.

In the multi-language recognition device of the voice, the hidden layers of the mixed bilingual models comprise a bottom hidden layer and a high hidden layer which are distinguished according to a preset proportion, and the bottom hidden layer is used for combining and generating a shared hidden layer; the confidence coefficient generation module is specifically used for inputting the voice data to be recognized into a shared hidden layer of a plurality of mixed bilingual models to obtain a first output result; inputting the first output results into the high-level hidden layers of the mixed bilingual models respectively to obtain a plurality of second output results; and combining the second output results to be used as an input item of a preset language classification model to obtain a plurality of confidence coefficients aiming at each language.

The confidence coefficient generation module is specifically configured to splice the multidimensional feature vectors used for representing the second output result according to corresponding dimensions, and use the spliced feature vectors as input items of the preset language classification model to obtain multiple confidence coefficients for different languages. Specifically, when there is one or only one confidence coefficient greater than a preset value, determining the language corresponding to the confidence coefficient as the language corresponding to the voice data to be recognized; or when two or more confidence degrees are larger than a preset value, determining the language corresponding to the maximum confidence degree value as the language corresponding to the voice data to be recognized; or when the confidence degrees do not reach the preset value, the language corresponding to the maximum confidence degree value is the language of the voice data to be recognized.

In a multilingual speech recognition apparatus, the multilingual acoustic model is created based on a neural network, the apparatus further comprising: and the multilingual acoustic model generating module is used for generating the multilingual acoustic model based on the fusion of the shared hidden layers of the multiple mixed bilingual models.

The multilingual acoustic model generation module is specifically used for dividing hidden layers of a plurality of mixed bilingual models into a bottom hidden layer and a high hidden layer according to a preset proportion, and combining the bottom hidden layer to generate a shared hidden layer, wherein the mixed bilingual models are neural networks comprising a plurality of hidden layers, and the high hidden layers of the mixed bilingual models have language characteristics corresponding to the mixed bilingual models; adding a hidden layer of a preset language model on an output layer of the shared hidden layer; combining a plurality of output layers of the high-level hidden layer with language characteristics corresponding to the mixed bilingual models to serve as input layers of the preset language classification models to construct preset language classification models, and independently constructing the mixed output layers of the high-level hidden layers of the mixed bilingual models respectively, so as to generate the multilingual acoustic model by adopting the shared hidden layer, the hidden layer of the preset language models, the hidden layer, the preset language classification models and the mixed output layers.

The application also discloses a vehicle-mounted terminal, include: a multi-lingual speech recognition device, a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, performing the steps of multi-lingual recognition of any of the speech.

The present application further discloses a computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements any of the steps of the method for language recognition based on multilingual acoustic models or any of the steps of multilingual recognition of speech.

The application includes the following advantages:

in the application, a multilingual acoustic model generated by fusing a shared hidden layer based on a plurality of mixed bilingual models is adopted, the speech data to be recognized is recognized to obtain confidence coefficients aiming at all languages, the language corresponding to the speech data to be recognized is determined based on the obtained confidence coefficients, and the multilingual recognition of the speech is completed. The multilingual acoustic model obtained by fusing the shared hidden layers based on the multiple mixed bilingual models identifies the multilingual languages of the voice, the calculated amount in the traditional multilingual identification model is reduced based on the shared hidden layers in the model, the efficiency of identifying the languages is improved, and further the user experience is improved.

Drawings

FIG. 1 is a diagram illustrating a model of a multilingual acoustic model in the related art;

FIG. 2 is a flow chart illustrating the steps of the method for multi-lingual speech recognition provided by the present application;

FIG. 3 is a schematic diagram of a multilingual acoustic model provided herein;

FIG. 4 is a schematic diagram of an application of the multilingual acoustic model provided in the present application;

FIG. 5 is a block diagram of a device for multiple speech recognition methods according to the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

The speech data to be recognized may not only be speech of a single language, but also mixed speech of two languages or mixed speech of multiple languages, for example, in the scene of popularizing products in a wider region such as the global region, Asian region, European region and the like, because there are more language classifications in a certain region, for example, there are more than 20 languages in different languages in a certain region, moreover, the language family difference between different languages is large, it is difficult to achieve uniform modeling regarding language identification, and because the countries in a certain region have small occupied area and exchange more frequently, that is, besides supporting the native language, the local region user needs to satisfy the POI (Point of Interest) and command word recognition for other major countries and regions, and the modeling cost and the user experience of the station are considered, and the established multi-mixed-language identification system needs to meet the characteristics of less occupied resources and high identification speed.

At present, for the construction of multiple mixed multi-language recognition models, referring to fig. 1, a model diagram of a multi-language acoustic model in the related art is shown, for the recognition of a user speech language in a certain region, since it is impossible to perform acoustic modeling for different types of languages simultaneously, assuming that a preset language widely used in the region is english, it is usually possible to perform modeling for more than 20 sets of mixed dual-language models, such as english-german (its neural network layer, e.g., N-layer LSTM (Long short-term memory network) hidden layer, which can be used to output english-factor feature vectors and german-language factor feature vectors, so as to perform softmax score calculation through its mixed output layer), english-french (its neural network layer, e.g., M-layer LSTM hidden layer, can be used to output english-language factor feature vectors and french-language feature vectors, so as to perform softmax score calculation through its mixed output layer), and the like based on the consideration that english is a language widely used, the method mainly adopts corresponding language systems to carry out mixed bilingual modeling on place names, person names, special organization names and the like of each country in the region, and can also carry out modeling on general command words based on more widely applied languages, such as English, so as to ensure that instructions are finished by adopting English languages under the condition that the language identification of other languages is inaccurate, and provide bottom-finding effects. In the process of using the multilingual acoustic model shown in fig. 1, after obtaining a plurality of mixed bilingual models by modeling respective groups of mixed bilingual acoustic models, such as english-germany and english-french, the language corresponding to the speech data to be recognized may be determined based on the language scores of the speech data to be recognized in the plurality of mixed bilingual models, respectively.

However, this language identification method based on multiple sets of acoustic model output scores relates to modeling more than 20 sets of mixed bilingual models such as english-de and english-law, which consumes a large amount of memory and has high requirements on the machine deployed for modeling; under the conditions of modeling more than 20 sets of mixed bilingual models such as English-De and English-French models and adopting a plurality of sets of mixed bilingual acoustic models to calculate the language scores of the voice requests of users, the calculation amount is large in the using process of the multilingual acoustic model, at the moment, a CPU (Central Processing Unit) with stronger use performance is needed while the size of the acoustic model is reduced, the reduction of the size of the acoustic model is expressed as the reduction of the dimension of a feature vector and the number of layers of a neural network, the reduction of the screening of the feature by the neural network in the acoustic model causes the deterioration of the identification effect of the model and the scores PK of the plurality of sets of language scores, for the aspect of user experience, the screen-up result changes frequently and the time consumption of the score PK model causes the increase of screen-up delay, and the requirements of less occupied resources and high identification speed cannot be met, affecting the on-screen experience of the user.

Referring to fig. 2, a flow chart illustrating steps of the speech multi-language recognition method provided by the present application is shown, which may specifically include the following steps:

step 201, acquiring voice data to be recognized and a multilingual acoustic model, wherein the multilingual acoustic model is obtained based on fusion of shared hidden layers of a plurality of mixed bilingual models;

in the application, the multilingual acoustic model obtained by fusing the shared hidden layers based on the multiple mixed bilingual models can identify the multilingual languages of the voice, the calculated amount in the traditional multilingual identification model is reduced based on the shared hidden layers in the model, and the efficiency of language identification is improved.

The multi-language acoustic model adopted by the application can be constructed before the multi-language acoustic model obtained by fusing the shared hidden layers based on the multiple mixed bilingual models is obtained without adopting multiple groups of mixed bilingual models to participate in the calculation and identification process of the model.

Specifically, one of the core ideas of the multilingual acoustic model constructed by the application is that the bottom layers of a plurality of groups of mixed bilingual models are implicitly laminated to generate a shared hidden layer, and the consumption of the constructed multilingual acoustic model on a memory is reduced on the basis of the combined shared hidden layer; and in the process of identifying the voice data to be identified by adopting the mixed bilingual model, the process of language classification is added on the basis of the preset language classification model, and the output results of each high-level cache layer are cached before the language identification is determined, so that the subsequent display can be carried out on the basis of the mixed bilingual model of the determined language when the corresponding language of the voice data is displayed on the screen, and the calculation amount of the multilingual acoustic model is reduced.

In practical application, the multilingual acoustic model can be generated by adopting a shared hidden layer, a hidden layer of a preset language model, a high-level hidden layer, a preset language classification model and a constructed mixed output layer.

For the construction of the shared hidden layer, for the independent modeling of each group of mixed bilingual acoustic models, such as English-Germany and English-French acoustic models, in the existing multilingual acoustic models, the mixed bilingual models respectively have a neural network with a plurality of hidden layers, and in the plurality of hidden layers of the mixed bilingual models, for example the N layers may contain hidden layers with parameter commonality to other mixed bilingual models, in the hidden layers with parameter commonality, the neural network of each layer can be used for extracting the dimensionality of each common feature vector in the language model, for example, aiming at the features of pause, syllable length, etc. commonly existing among different languages, the multi-layer hidden layer of the mixed dual-language model can also comprise a hidden layer with the characteristics of the language itself, the hidden layer with parameter commonality can be called as a bottom hidden layer, and the hidden layer with obvious language characteristics can be called as a high hidden layer. It should be noted that each group of mixed dual languages may be respectively mixed and constructed based on the preset language widely applied in the area and other languages, and is not limited to acoustic models such as english-de, english-law, and the like.

Referring to fig. 3, a model diagram of a multilingual acoustic model provided in the present application is shown, in order to reduce memory consumption of the established multilingual acoustic model and reduce computation of the multilingual acoustic model, at this time, hidden layers included in multiple sets of hybrid bilingual models, such as english-germany models and english-french models, may be divided based on whether there are explicit language features, and the bottom hidden layers may be combined into shared hidden layers according to a preset proportion, for example, 80% of the bottom hidden layers and 20% of the upper hidden layers in each hybrid bilingual model, and the upper hidden layers with explicit language features are retained, so that the hardware adaptation of the established multilingual acoustic model on the device is improved based on introduction of the bottom hidden layers.

In the constructed multilingual acoustic model, the output result of the combined shared hidden layer can be used as an input item of each reserved high-level hidden layer, so that the voice data can be displayed according to the corresponding language based on the mixed output layer of the determined language when the voice data is output at the high-level hidden layer in the subsequent process, and the modeling precision of the multilingual acoustic model is improved based on the reservation of the high-level hidden layer while the memory consumption and the calculated amount are reduced.

In the constructed multilingual acoustic model, the combined output result of the shared hidden layer is used as an input item of a high-level hidden layer, meanwhile, the hidden layer of a preset language model can be added to the output layer of the shared hidden layer, so that the output result of the shared hidden layer is used as an input item of the preset language model, voice data can be displayed on a screen according to the preset language before the language is determined, time delay of on-screen display is reduced, frequent language change of the on-screen display result is achieved, and user on-screen experience is improved. The preset language may refer to a language widely applied to a certain region, for example, for recognition of a speech language of a user in a country in a european region, english is a language widely applied to the european region, and at this time, a hidden layer of a preset english model may be added to an output layer of the shared hidden layer, so that the constructed multilingual acoustic model is provided with a bottom-entering effect while being displayed on a screen in the english language.

The method for determining the language of the voice can be realized by introducing a preset language classification model, the high-level hidden layers of the mixed bilingual models have language features corresponding to the mixed bilingual models, specifically, a plurality of output layers of the high-level hidden layers having the language features corresponding to the mixed bilingual models can be combined to be used as input layers of the preset language classification model to construct the preset language classification model, and when the language of the voice data is determined, the corresponding language can be determined for the confidence coefficient of each language classification through the preset language classification model.

The language features used for training the language classification model mainly have high-level abstract features, as shown in fig. 3, the high-level abstract features are output based on a bottom hidden layer and a high hidden layer of a plurality of mixed bilingual models, and because the neural network in the model can directly use the language features with the high-level abstract features, the hidden layer is not required to be pre-positioned and extracted, under the condition of constructing the language classification model based on the high-level abstract features, the number of layers of the neural network in the model can be reduced, meanwhile, the required feature dimensions can be guaranteed to be provided for the model, and under the condition of guaranteeing low delay and low calculation amount of the language classification model based on fewer number of layers of the neural network, the high recognition effect of the model is guaranteed based on the splicing of the high-level abstract features of the calculation cache.

For the composition of the mixed output layer in the multilingual acoustic model, the mixed output layers which are respectively and independently formed can be used for decoding the languages in the voice data to be recognized and displaying the corresponding languages on a screen, under specific conditions, in order to ensure accurate determination of the language recognition and reduce the calculation amount of the multilingual acoustic model, firstly, the output results of the hidden layers of a plurality of mixed bilingual models can be cached, after the languages of the voice data are determined based on the preset language classification model, the output results of the hidden layers of the high layers are input into the mixed output layer of the corresponding mixed bilingual model to be processed, and then the mixed output layers which are respectively and independently formed by the hidden layers of the mixed bilingual models can be set to perform softmax after the corresponding languages are determined based on the preset language classification model.

Caching output results of a high-level hidden layer of a plurality of mixed bilingual models, ensuring that softmax calculation is not carried out on the output results of the high-level hidden layer before the languages are determined, wherein the softmax calculation is a tool in machine learning and can be used for calculating the proportion of each value in a group of numerical values, determining the similarity degree of each word in voice data based on the calculated proportion, and screening to obtain the words for on-screen display.

As shown in fig. 3, it is possible to cache hidden layer output (i.e., output result of a higher hidden layer) before the last layer of softmax (i.e., the constructed mixed output layer) in each mixed bilingual model, if not, the softmax calculation is required for the mixed output layer of each mixed bilingual model, and in order to ensure reduction of the calculated amount, the calculation of the mixed output layer of each mixed bilingual model may be suspended, that is, no softmax calculation is performed before the language is determined, so that the softmax calculation of the mixed output layer is started again after the language is determined, but at this time, the softmax calculation of the speech data to be recognized only needs to be performed through the mixed output layer of the mixed bilingual model corresponding to the determined language.

It should be noted that the multiple hybrid bilingual models are neural networks including multiple hidden layers, and an LSTM structure is adopted, so that the multilingual acoustic model constructed based on the shared hidden layer and the high-level hidden layer in each hybrid bilingual model is also an LSTM structure, in the LSTM structure, the hidden layer dimension of each frame of data of the voice data to be recognized does not increase with the passage of time, that is, the hidden layer dimension is fixed, when the frame number of the voice data to be recognized is 20 frames, and the dimension of the hidden layer in the constructed multilingual acoustic model is 512, the memory occupied by the multilingual acoustic model in the processes of language recognition and voice data on-screen display can be 20 × 4byte =0.78MB, and the constructed multilingual acoustic model is suitable for cloud storage.

Step 202, obtaining confidence coefficients aiming at various languages according to voice data to be recognized and a multilingual acoustic model;

after obtaining the neural network implementation based on the multilayer hidden layers included in the multiple mixed bilingual models, specifically, the multilingual acoustic models obtained by fusing the shared hidden layers based on the multiple mixed bilingual models, the multilingual acoustic models can be used for recognizing the voice data to be recognized, so as to obtain confidence coefficients for each language, and then the language corresponding to the voice data to be recognized can be determined based on the obtained confidence coefficients.

Specifically, the hidden layers of the multiple hybrid bilingual models are divided into a bottom hidden layer and a high hidden layer according to a preset proportion, the bottom hidden layer is used for combining and generating a shared hidden layer used for constructing the multilingual acoustic model, at the moment, the voice data to be recognized can obtain a first output result without obvious language features through the shared hidden layers of the multiple hybrid bilingual models, the first output result is respectively output to the multiple high hidden layers of the multiple hybrid bilingual models, second output results with obvious language features are respectively obtained, and then the obtained second output results can be used as input items of the preset language classification model to obtain multiple confidence coefficients for each language.

Inputting the voice data to be recognized into a shared hidden layer of a plurality of mixed bilingual models to obtain a first output result, wherein the shared hidden layer can refer to a hidden layer with parameter commonality in each different mixed bilingual model, for example, a hidden layer aiming at the characteristics of pause, syllable length and the like commonly existing among different languages, and the obtained first output result does not have obvious language characteristics and cannot be used for judging language recognition temporarily.

The high-level hidden layer may refer to a hidden layer having an obvious language feature in a plurality of mixed bilingual models, and at this time, based on a plurality of second output results output by the high-level hidden layer, the second output results may respectively carry the language feature for each specific language family, and such output results may be used for determining language identification. At this time, the output results of the high-level hidden layers of the multiple mixed bilingual models, namely the multiple second output results, can be temporarily cached, so that softmax calculation is not performed on the output results of the high-level hidden layers before the language is determined, namely, the mixed output layer calculation of each mixed bilingual model can be suspended at this time, and no softmax calculation is performed before the language is determined, so that when the softmax calculation of the mixed output layer can be started again after the language is determined, it is ensured that softmax calculation is performed on the voice data to be recognized only through the mixed output layer of the mixed bilingual model corresponding to the determined language, and the purpose of reducing the model calculation amount is achieved.

The cached second output results need to be subjected to softmax calculation under the condition of determining the recognized languages, at the moment, the softmax calculation can be performed on the basis of the introduced preset language model, such as a hidden layer of an English model, so that the on-screen display during the period of screen-on incapability caused by the fact that the softmax is not performed can be ensured, specifically, before the language classification result is determined, the softmax calculation of the English model can be performed through the softmax calculation, the on-screen display is performed on the voice data to be recognized by adopting the result of the softmax calculation, the waiting time of a user on the screen during the period of determining the language classification result can be reduced, and the influence on experience caused by the fact that the real-time screen-on incapability caused by the fact that the softmax is not performed before the language classification result is determined can be avoided.

In order to determine the language of the voice data to be recognized, a preset language classification model can be introduced to realize the determination, the output result of the high-level hidden layer carries obvious language features specific to a particular language family, and specifically, a plurality of output layers of the high-level hidden layers of a plurality of mixed bilingual models can be used as input layers for training the preset language classification model to construct the preset language classification model, so that the corresponding language can be determined for the confidence of each language classification through the preset language classification model when the language of the voice data is determined in the subsequent process.

Specifically, the high-level hidden layer of each mixed bilingual model has language features corresponding to each mixed bilingual model, so the output result of the high-level hidden layer has language colors of an obvious language, at this time, multidimensional feature vectors used for representing the second output result can be spliced according to corresponding dimensions, as shown in fig. 3, the language feature vectors, such as german hidden features and french hidden features, are spliced, the spliced language features are used as the input layer of the preset language classification model, the constructed preset language classification model can have M layers of convolution layer formers, and the convolution layers are used for performing language softmax score calculation to obtain confidence coefficients for each language. The spliced language features are high-level abstract features, a plurality of confidence coefficients aiming at different languages can be output in a short time based on language differentiation among the spliced language features under the condition that complete voice request audio does not need to be recognized, the output confidence coefficients can be used for judging the real-time languages, and the real-time performance and the language classification accuracy of a voice recognition system during mixed model decision can be guaranteed.

It should be noted that the language features used for training the language classification model mainly have high-level abstract features, as shown in fig. 3, the high-level abstract features are based on the output of the bottom hidden layer and the high hidden layer of the multiple mixed bilingual models, and belong to the calculation cache, because the neural network in the model can directly use the language features having the high-level abstract features, and the hidden layer does not need to be pre-positioned and extracted, at this time, under the condition of constructing the language classification model, the number of neural network layers in the model can be reduced, and simultaneously, the required feature dimensions can be provided for the model, and under the condition of ensuring the low delay and low computation amount of the language classification model based on the small number of neural network layers, the high recognition effect of the model can be ensured based on the splicing of the high-level abstract features.

Step 203, determining the language corresponding to the voice data to be recognized based on the confidence for each language.

In practical application, the speech data to be recognized can be decoded in real time, and words of continuous preset length frames obtained by real-time decoding are input into the language classification model to determine the language result of the real-time speech segment. Specifically, based on the language classification decision design experienced by the user, the words of the continuous preset length frames obtained by real-time decoding can be input into the language classification model, the confidence of the words of the continuous preset length frames for each language is obtained through the language softmax calculation of the language classification model, and the language corresponding to the voice data to be recognized is determined based on the confidence.

The confidence degrees for the languages can be used for representing the recognition possibility of the voice data to be recognized and the languages, so that when the language of the voice data is determined, the language corresponding to the voice data to be recognized can be determined based on the confidence degrees of the languages through a preset language classification model. Specifically, the real-time language result for the input word may be determined based on the determination result of the confidence level and the preset value.

For each word in the input continuous preset length frame, no matter whether the time-out (i.e. the number of words) of the word is exceeded when the voice data to be recognized is decoded, in one case, if there is and only one confidence level that is greater than the preset value, that is, there is a certain confidence level that exceeds the threshold value of the confidence level of a certain language, it is determined that the language corresponding to the confidence level is the language corresponding to the voice data to be recognized; in another case, if two or more confidences are greater than a preset value, determining the language corresponding to the voice data to be recognized with the maximum confidence value as the language corresponding to the voice data to be recognized; in another case, if none of the confidence levels reaches the preset value, the language corresponding to the highest confidence level may be the language of the voice data to be recognized. The preset value may be a confidence threshold for each language, which is not limited in the embodiments of the present invention.

Exemplarily, assuming that language identification of 2 nd to 5 th frames in the speech data to be identified can achieve fast and accurate language identification, the language identification does not time out when the speech data is decoded to obtain words, that is, does not exceed 5 words, at this time, as long as frames with a preset length are continuously provided, for example, the language classification confidence (that is, the softmax maximum score of 21 dimension) of each word in 5 consecutive frames for a certain language exceeds a confidence threshold of 0.8, it indicates that the language identification of the speech data to be identified is finished, otherwise, it is necessary to continuously judge the language of the latest 5 consecutive frames of speech data in the speech data to be identified; if the word is decoded from the speech data to be recognized, that is, the word exceeds 5 words, it is assumed that frames with a preset length are continuous, for example, the confidence of each word in five continuous frames for each language does not reach the confidence threshold of each language, that is, the confidence does not reach the confidence standard, and at this time, the language result with the highest confidence score can be determined from the speech data of the latest 5 frames as the final language classification result.

In the application, the voice data to be recognized can also be decoded, and the decoded voice data can be displayed in real time.

Specifically, before determining the language corresponding to the voice data to be recognized, the voice data to be recognized is solved and the decoded voice data is displayed in a preset language, and after determining the language corresponding to the voice data to be recognized, the voice data to be recognized is decoded by using the mixed bilingual model corresponding to the determined language, and the decoded voice data is continuously displayed in a replacement manner in the determined language.

In a specific implementation, for the on-screen display before the language is determined, after the first output result in step 202 is input to the high-level hidden layers of the multiple mixed bilingual models, the output result, that is, the multiple second output results, may be cached, it is ensured that the softmax calculation is not performed on the output result of the high-level hidden layer before the language is determined, that is, the mixed output layer calculation of each mixed bilingual model may be suspended, and no softmax calculation is performed before the language is determined, so after the first output result is obtained, in the constructed multilingual acoustic model, the hidden layer of the preset language model may be added to the output layer sharing the hidden layer, that is, while the combined output result of the shared hidden layer, that is, the first output result is input to the high-level hidden layers of the multiple mixed bilingual models, the first output result may also be input to the hidden layer of the preset language model, and obtaining a third output result, so that the voice data to be recognized can be displayed in the preset language by adopting the third output result before the language corresponding to the voice data to be recognized is determined.

For example, as shown in fig. 3, the preset language may refer to a language widely used in a certain region, for example, for recognition of a speech language of a user in a country in a european region, english is a language widely used in the european region, and at this time, a hidden layer of a preset english model may be added to an output layer sharing the hidden layer, so as to provide a bottom-finding effect while displaying the constructed multilingual acoustic model in the english on a screen.

And in order to avoid the problem of frequent change of the screen-on result, the language can be replaced after the language is determined. Specifically, as shown in fig. 3, the high-level hidden layers of the multiple hybrid bilingual models respectively and independently form the hybrid output layer, at this time, after the language corresponding to the voice data to be recognized is determined, the hybrid bilingual model corresponding to the language of the voice data to be recognized may be adopted, the displayed voice information is displayed in a replacement manner in the language corresponding to the voice data to be recognized, specifically, the cached output result of the high-level hidden layer of the hybrid bilingual model corresponding to the language may be output based on softmax calculation performed by the hybrid output layer of the hybrid bilingual model, the previous screen information displayed in english is correspondingly replaced, low time consumption and low time delay in screen display are realized, and user experience is improved.

In practical application, the corresponding language can be replaced on the display result of the upper screen according to softmax calculation of the mixed bilingual model matched with the real-time language classification result. The speech data to be recognized may be mixed language audio, for example, english + audio of a local language location (assuming english + french), at this time, after the recognized language is determined, the corresponding bilingual mixed acoustic model may be activated, and softmax of the cached cache output layer is calculated by using the model, so as to recognize each word, and the word in the screen result is replaced by the word recognized as french.

Referring to fig. 4, an application diagram of the multilingual acoustic model provided in the present application is shown, the constructed multilingual acoustic model can be applied to a scene for recognizing personalized speech of a user, and an acoustic language working mechanism corresponding to the multilingual acoustic model can be divided into two stages, namely, decoding and on-screen display.

Specifically, before determining the language by using the constructed multilingual acoustic model, the streaming screen-up result can be decoded based on language-related models widely applied to the region, such as the working of an English acoustic model and an English language model, so that the user experience is ensured, and meanwhile, the output of a hidden layer can be cached based on the multilingual acoustic model to be used for language judgment; after the languages are determined, the english branches shown in fig. 3 are not calculated any more, and at this time, softmax calculation may be performed by using a corresponding hybrid bilingual model according to the language results, that is, softmax calculation may be performed by using a hybrid output layer formed based on a higher-level cache layer in the hybrid bilingual model, so that the softmax calculation result is used for screen-up result replacement, and the language model corresponding to the language is called for normal decoding at the same time.

The voice recognition of the user may be expressed in that in the process of decoding the voice data to be recognized, the resource information of the city where the corresponding IP address is located and the language information determined based on the multilingual acoustic model may be called based on an IP (Internet Protocol) address of the user, so as to improve the recognition rate of the voice data to be recognized. As shown in fig. 4, the resource may refer to an additional Ngram model (an algorithm based on a statistical Language model) trained using place names, the general neural Network NNLM (neural Network Language model) may be trained based on texts related to the place names of POIs (points of Interest) of the whole country corresponding to the identified Language, and the personalized city-level model is obtained mainly based on (a small amount of) POI data trained based on the place name texts of the corresponding city compared to the general neural Network NNLM, and the personalized city-level model has a smaller volume and completes the construction of the personalized Language model in consideration of calculation amount and storage.

In the application scenario, the language model can be established based on the language determination of the voice data to be identified of the user and the resource information of the user by means of the established multilingual acoustic model, so that the user resource information is comprehensively adopted to realize the identification of the user language, and the accuracy of the language identification is improved.

It is noted that for simplicity of explanation, the methodologies are shown as a series of acts or combinations of acts, but those skilled in the art will appreciate that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders and/or concurrently. Further, those skilled in the art will recognize that the descriptions set forth herein are presently preferred, and that no particular act is required.

Referring to fig. 5, a block diagram of a multi-language speech recognition apparatus provided in the present application is shown, which may specifically include the following modules:

a multilingual acoustic model obtaining module 501, configured to obtain voice data to be recognized and a multilingual acoustic model; the multilingual acoustic model is obtained based on fusion of shared hidden layers of a plurality of mixed bilingual models;

a confidence generating module 502, configured to obtain confidence levels for various languages according to the to-be-recognized speech data and the multilingual acoustic model;

a language identification module 503, configured to determine, based on the confidence degrees for the languages, a language corresponding to the voice data to be identified.

In the multilingual speech recognition apparatus, the apparatus may further include: and the upper screen display module is used for decoding the voice data to be recognized and displaying the decoded voice data in real time. The upper screen display module is specifically used for decoding the voice data to be recognized and displaying the decoded voice data in a preset language before determining the language corresponding to the voice data to be recognized; and after the language corresponding to the voice data to be recognized is determined, decoding the voice data to be recognized by adopting a mixed bilingual model corresponding to the determined language, and continuously performing replacement display on the decoded voice data in the determined language.

In the multi-language recognition device of the voice, the hidden layers of the mixed bilingual models comprise a bottom hidden layer and a high hidden layer which are distinguished according to a preset proportion, and the bottom hidden layer is used for combining and generating a shared hidden layer; the confidence coefficient generation module 502 is specifically configured to input the speech data to be recognized into a shared hidden layer of a plurality of mixed bilingual models, so as to obtain a first output result; inputting the first output results into the high-level hidden layers of the mixed bilingual models respectively to obtain a plurality of second output results; and combining the second output results to be used as an input item of a preset language classification model to obtain a plurality of confidence coefficients aiming at each language.

The confidence generating module 502 may be specifically configured to splice the multidimensional feature vectors used for characterizing the second output result according to corresponding dimensions, and use the spliced feature vectors as input items of the preset language classification model to obtain multiple confidences for different languages. Specifically, when there is one or only one confidence coefficient greater than a preset value, determining the language corresponding to the confidence coefficient as the language corresponding to the voice data to be recognized; or when two or more confidence degrees are larger than a preset value, determining the language corresponding to the maximum confidence degree value as the language corresponding to the voice data to be recognized; or when the confidence degrees do not reach preset values, the language corresponding to the maximum confidence degree value is the language of the voice data to be recognized.

In the multilingual speech recognition apparatus, the multilingual acoustic model is built based on a neural network, and the apparatus may further include: and the multilingual acoustic model generating module is used for generating the multilingual acoustic model based on the fusion of the shared hidden layers of the multiple mixed bilingual models.

The multilingual acoustic model generation module is specifically used for dividing hidden layers of a plurality of mixed bilingual models into a bottom hidden layer and a high hidden layer according to a preset proportion, and combining the bottom hidden layer to generate a shared hidden layer, wherein the mixed bilingual models are neural networks comprising the multilayer hidden layers, and the high hidden layers of the mixed bilingual models have language characteristics corresponding to the mixed bilingual models; adding a hidden layer of a preset language model on an output layer of the shared hidden layer; combining a plurality of output layers of the high-level hidden layer with language characteristics corresponding to the mixed bilingual models to serve as input layers of the preset language classification models to construct preset language classification models, and independently constructing the mixed output layers of the high-level hidden layers of the mixed bilingual models respectively, so as to generate the multilingual acoustic model by adopting the shared hidden layer, the hidden layer of the preset language models, the hidden layer, the preset language classification models and the mixed output layers.

For the device, because it is basically similar to the method, so the description is simple, and the relevant points can be referred to the partial description of the method.

The application also provides a vehicle-mounted terminal, including:

the multi-language recognition device comprises the voice multi-language recognition device, a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein when the computer program is executed by the processor, each process of the voice-based multi-language recognition method is realized, the same technical effect can be achieved, and in order to avoid repetition, the details are not repeated. The application also provides a computer-readable storage medium, on which a computer program is stored, and when being executed by a processor, the computer program realizes each process of the above-mentioned multi-language speech recognition method, and can achieve the same technical effect, and for avoiding repetition, the details are not repeated here.

The various examples in this specification are described in a progressive manner, each example focuses on differences from other examples, and the same and similar parts among the various examples can be referred to each other.

As will be appreciated by one of skill in the art, the examples of the present application may be provided as a method, apparatus, or computer program product. Accordingly, the present examples may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, the present examples may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present examples are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to examples of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred examples of the present application have been described, additional variations and modifications to these examples may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred examples and all changes and modifications that fall within the exemplary scope of the present application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method, the device, the terminal and the storage medium for multi-language recognition of voice provided by the present application are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation of the present application, and the above description is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for multilingual speech recognition, the method comprising:

acquiring voice data to be recognized and a multilingual acoustic model; the multilingual acoustic model is obtained based on fusion of shared hidden layers of a plurality of mixed bilingual models; wherein the shared hidden layer is generated based on underlying hidden layers of the plurality of hybrid bilingual models;

obtaining confidence degrees aiming at various languages according to the voice data to be recognized and the multilingual acoustic model; the confidence coefficient is obtained by combining and inputting output results of the high-level hidden layers of the multiple mixed bilingual models to a preset language classification model based on the output results of the high-level hidden layers of the multiple mixed bilingual models, and the output results of the high-level hidden layers of the multiple mixed bilingual models are obtained by inputting the output results of the shared hidden layers of the multiple mixed bilingual models to the high-level hidden layers based on the output results of the shared hidden layers of the multiple mixed bilingual models;

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein the hidden layers of the plurality of hybrid bilingual models comprise a bottom hidden layer and a top hidden layer distinguished according to a preset ratio for merging and generating the shared hidden layer; the obtaining confidence degrees aiming at various languages according to the voice data to be recognized and the multilingual acoustic model comprises the following steps:

4. The method according to claim 3, wherein said merging the second output results as an input of a predetermined language classification model to obtain confidence levels for each language comprises:

5. The method according to any one of claims 1 to 4, wherein the determining the language corresponding to the voice data to be recognized based on the confidence degrees for the languages includes:

6. The method of claim 3, wherein the displaying the decoded speech data in real-time comprises:

7. The method of claim 6, wherein the multilingual acoustic models include a predetermined language model, and the parsing the speech data to be recognized and displaying the parsed speech data in the predetermined language before determining the language corresponding to the speech data to be recognized comprises:

8. The method according to claim 6, wherein the high-level hidden layers of the mixed bilingual models respectively and independently constitute a mixed output layer, and after determining the language type corresponding to the voice data to be recognized, parsing the voice data to be recognized by using the mixed bilingual model corresponding to the determined language type, and continuing to display the parsed voice data in the determined language type, the method comprises:

9. The method of claim 1, wherein the multilingual acoustic model is created based on a neural network, further comprising:

10. An apparatus for multi-lingual speech recognition, the apparatus comprising:

the multilingual acoustic model acquisition module is used for acquiring the voice data to be recognized and the multilingual acoustic model; the multilingual acoustic model is obtained based on fusion of shared hidden layers of a plurality of mixed bilingual models; wherein the shared hidden layer is generated based on underlying hidden layers of the plurality of hybrid bilingual models;

the confidence coefficient generation module is used for obtaining confidence coefficients aiming at all languages according to the voice data to be recognized and the multilingual acoustic model; the confidence coefficient is obtained by combining and inputting output results of the high-level hidden layers of the multiple mixed bilingual models to a preset language classification model based on the output results of the high-level hidden layers of the multiple mixed bilingual models, and the output results of the high-level hidden layers of the multiple mixed bilingual models are obtained by inputting the output results of the shared hidden layers of the multiple mixed bilingual models to the high-level hidden layers based on the output results of the shared hidden layers of the multiple mixed bilingual models;

11. A vehicle-mounted terminal characterized by comprising: multi-language recognition apparatus of a speech according to claim 10, a processor, a memory and a computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, carries out the steps of the multi-language recognition method of a speech according to any one of claims 1 to 9.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for multiple language recognition of speech as claimed in any one of claims 1 to 9.