WO2023138286A1

WO2023138286A1 - Multi-language recognition method and apparatus for speech, and terminal and storage medium

Info

Publication number: WO2023138286A1
Application number: PCT/CN2022/140282
Authority: WO
Inventors: 张辽
Original assignee: 广州小鹏汽车科技有限公司
Priority date: 2022-01-19
Filing date: 2022-12-20
Publication date: 2023-07-27
Also published as: CN114078468A; CN114078468B

Abstract

A multi-language recognition method and apparatus for a speech, and a terminal and a storage medium. The method comprises: acquiring speech data to be subjected to recognition and a multi-language acoustic model, wherein the multi-language acoustic model is obtained by means of performing fusion on the basis of a shared hidden layer of a plurality of hybrid bilingual models (201); obtaining a confidence level for each language according to said speech data and the multi-language acoustic model (202); and determining, on the basis of the confidence level for each language, a language corresponding to said speech data (203).

Description

Multilingual voice recognition method, device, terminal and storage medium

This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office on January 19, 2022, with application number 202210058785.2, and application name "Multilingual Voice Recognition Method, Device, Terminal, and Storage Medium", the entire contents of which are incorporated in this application by reference.

technical field

The present application relates to the field of computer technology, and in particular to a multilingual voice recognition method, a corresponding multilingual voice recognition device, a corresponding vehicle-mounted terminal and a computer-readable storage medium.

Background technique

With the increasing maturity of artificial intelligence-related technologies, more and more smart devices have entered the lives of users, and the interaction between humans and machines has become increasingly common. As a natural and convenient way of human-computer interaction, voice input achieves the purpose of freeing hands. Most of the current smart devices have voice recognition function, which improves the convenience of users. At present, the speech data to be recognized may not only be speech in a single language, but may also be bilingual mixed speech or multilingual mixed speech. For the construction of multiple mixed multilingual recognition models, it is mainly possible to model the acoustic models of each group of mixed bilingual languages, such as English, German, English and French, and realize the language recognition based on the output scores of multiple sets of acoustic models. This language recognition method requires a huge amount of calculation, and the language recognition efficiency is low.

Contents of the invention

The first aspect of the present application provides a multilingual voice recognition method, including:

Obtain speech data to be recognized and a multilingual acoustic model; the multilingual acoustic model is obtained based on the fusion of shared hidden layers of multiple mixed bilingual models;

Obtaining confidence levels for each language according to the speech data to be recognized and the multilingual acoustic model;

The language corresponding to the voice data to be recognized is determined based on the confidence for each language.

The second aspect of the present application provides a multilingual recognition device, including:

The multilingual acoustic model acquisition module is used to obtain the speech data to be recognized and the multilingual acoustic model; the multilingual acoustic model is obtained based on the shared hidden layer fusion of multiple mixed bilingual models;

A confidence level generation module, configured to obtain confidence levels for each language according to the speech data to be recognized and the multilingual acoustic model;

The language recognition module is configured to determine the language corresponding to the speech data to be recognized based on the confidence for each language.

The third aspect of the present application provides a vehicle-mounted terminal, including: the processor, a memory, and a computer program stored on the memory and capable of running on the processor. When the computer program is executed by the processor, any one of the steps of multilingual recognition of the voice is realized.

The fourth aspect of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, any one of the steps of the multilingual acoustic model-based language recognition method or any one of the steps of the multilingual recognition of speech is implemented.

According to the multilingual recognition method of speech provided in this application, in this application, the speech data to be recognized is recognized by using the multilingual acoustic model generated by the fusion of shared hidden layers based on multiple mixed bilingual models to obtain the confidence for each language, and the language corresponding to the speech data to be recognized is determined based on the obtained confidence to complete the multilingual recognition of the speech. The multilingual acoustic model based on the fusion of the shared hidden layers of multiple mixed bilingual models is used to recognize the multilingual speech. Based on the shared hidden layer in the model, the amount of calculation in the traditional multilingual recognition model is reduced, the efficiency of language recognition is improved, and the user experience is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Description of drawings

The above and other objects, features and advantages of the present application will become more apparent by describing the exemplary embodiments of the present application in more detail with reference to the accompanying drawings, wherein, in the exemplary embodiments of the present application, the same reference numerals generally represent the same components.

Fig. 1 is a model schematic diagram of a multilingual acoustic model in the related art;

Fig. 2 is the flow chart of the steps of the multilingual recognition method of speech provided by the present application;

Fig. 3 is a model schematic diagram of the multilingual acoustic model provided by the present application;

Fig. 4 is a schematic diagram of the application of the multilingual acoustic model provided by the present application;

Fig. 5 is a structural block diagram of the multilingual speech recognition method and device provided by the present application.

Detailed ways

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. Although embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the scope of this application to those skilled in the art.

The terminology used in this application is for the purpose of describing particular embodiments only, and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms "first", "second", "third" and so on may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present application, first information may also be called second information, and similarly, second information may also be called first information. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of the present application, "plurality" means two or more, unless otherwise specifically defined.

The present application provides a multilingual voice recognition method, a corresponding multilingual voice recognition device, a corresponding vehicle-mounted terminal, and a computer-readable storage medium that overcome the problems mentioned in the background technology or at least partially solve the problems mentioned in the background technology.

The technical solution of the present application will be described in detail below in conjunction with the accompanying drawings.

The speech data to be recognized may not only be speech in a single language, but may also be bilingual mixed speech or multilingual mixed speech. For example, in the scenario of promoting products in a wide range of regions such as the world, Asia, and Europe, because there are many language classifications in a certain region, for example, there are more than 20 languages of different languages in a certain region, and the language families between different languages are quite different. It is difficult to achieve a unified modeling of language recognition. In addition to supporting the national language, it is also necessary to meet the POI (Point of Interest) and command word recognition for other major countries and regions. From the perspective of modeling cost and user experience, the multi-language recognition system established needs to meet the characteristics of less resource occupation and fast recognition speed.

At present, for the construction of a variety of mixed multilingual recognition models, referring to Figure 1, a schematic diagram of a multilingual acoustic model in the related art is shown. For the user speech language recognition in a certain area, since it is impossible to perform acoustic modeling on different languages at the same time, it is assumed that the preset language widely used in this area is English. Usually, based on the consideration that English is a widely used language, using, for example, English-German (its neural network layer such as N-layer LSTM (Long short-term memory, long short-term memory network) hidden layer can be used for the English factor feature vector and German factor feature vectors to calculate the softmax score through its mixed output layer), English-French (its neural network layer, such as the M-layer LSTM hidden layer can be used to output English factor feature vectors and French factor feature vectors through its mixed output layer to perform softmax score calculation) and more than 20 sets of mixed bilingual models. Modeling is mainly to use the corresponding language system for mixed bilingual modeling of place names, person names, and institutional names in this region. In order to ensure that the English language is used to complete the instructions when the language recognition of other languages is inaccurate, providing a bottom-up effect. In the process of using the multilingual acoustic model shown in Figure 1, each group of mixed bilinguals, such as English-German, English-French, etc., is modeled respectively, and multiple mixed bilingual models are obtained. Based on the language scores of the speech data to be recognized in the multiple mixed bilingual models, the language corresponding to the speech data to be recognized can be determined.

However, this language recognition method based on the output scores of multiple sets of acoustic models involves modeling more than 20 sets of mixed bilingual models such as English-German and English-French. Reducing the size of the acoustic model requires the use of a CPU (Central Processing Unit, central processing unit) with stronger performance. The reduction in the size of the acoustic model is manifested in the reduction of the dimension of the feature vector and the reduction of the number of neural network layers. Reducing the selection of features by the neural network in each layer of the acoustic model will lead to poor recognition of the model, and the scoring PK for multiple groups of language scores. For user experience, the frequent changes in the results on the screen and the time-consuming score PK model will increase the delay on the screen, which cannot meet the needs of less resources and recognition speed. Faster requirements affect the user's experience on the screen.

Referring to FIG. 2 , it shows a flow chart of the steps of the multilingual recognition method of speech provided by the present application, which may specifically include the following steps:

Step 201, acquiring speech data to be recognized and a multilingual acoustic model, wherein the multilingual acoustic model is fused based on shared hidden layers of multiple mixed bilingual models;

In this application, multilingual speech recognition can be performed by using a multilingual acoustic model based on the fusion of shared hidden layers of multiple mixed bilingual models. Based on the shared hidden layer in the model, the amount of calculation in the traditional multilingual recognition model can be reduced, and the efficiency of language recognition can be improved.

Among them, instead of using multiple sets of mixed bilingual models to participate in the calculation and recognition process of the model, the multilingual acoustic model used in this application can be constructed before obtaining the multilingual acoustic model based on the fusion of the shared hidden layers of multiple mixed bilingual models.

Specifically, one of the core ideas of the multilingual acoustic model constructed by the present application is to combine the underlying hidden layers of multiple groups of mixed bilingual models to generate a shared hidden layer, and reduce the memory consumption of the built multilingual acoustic model based on the merged shared hidden layers; and in the process of using the mixed bilingual model to recognize the speech data to be recognized, the language classification process is added based on the preset language classification model, and the output results of each high-level cache layer are cached before the language recognition is determined, so that the corresponding language of the speech data can be subsequently displayed on the screen. It can be displayed based on the mixed bilingual model with the determined language, reducing the amount of calculation of the multilingual acoustic model.

In practical applications, the multilingual acoustic model can be generated by using the shared hidden layer, the hidden layer of the preset language model, the high-level hidden layer, the preset language classification model and the constructed mixed output layer.

For the construction of the shared hidden layer, for the separate modeling of each group of mixed bilinguals in the existing multilingual acoustic model, such as English-German, English-French and other acoustic models, each mixed bilingual model has a neural network with multiple hidden layers. The characteristic vector dimension, such as pauses and syllable lengths that are common among different languages, the multi-layer hidden layer of the mixed bilingual model can also include a hidden layer with its own language characteristics. Among them, the hidden layer with parameter commonality can be called the bottom hidden layer, and the hidden layer with obvious language characteristics can be called the high-level hidden layer. It should be noted that each group of mixed bilinguals can be constructed based on the preset languages widely used in this region and other languages, and is not limited to English-German, English-French and other acoustic models.

Referring to FIG. 3 , it shows a schematic diagram of the multilingual acoustic model provided by the present application. In order to reduce the memory consumption of the established multilingual acoustic model and reduce the calculation amount of the multilingual acoustic model, multiple mixed bilingual models, such as English-German, English-French and other mixed bilingual models, can divide the hidden layers contained in multiple sets of mixed bilingual models based on whether they have obvious language characteristics. Usually, according to a preset ratio, for example, each mixed bilingual model has 80% of the underlying hidden layer and 20% of the high-level hidden layer. Divide, merge the underlying hidden layers into shared hidden layers, and retain the high-level hidden layers that obviously have the characteristics of a specific language family, based on the introduction of the underlying hidden layers, improve the hardware adaptation of the multilingual acoustic model built on the device.

In the constructed multilingual acoustic model, the output result of the merged shared hidden layer can be used as the input item of each retained high-level hidden layer, so that the voice data can be displayed in the corresponding language based on the mixed output layer with the determined language when outputting the high-level hidden layer. While reducing memory consumption and calculation load, the modeling accuracy of the multilingual acoustic model can be improved based on the reservation of the high-level hidden layer.

In the multilingual acoustic model constructed, while the output result of the merged shared hidden layer is used as the input item of the high-level hidden layer, the hidden layer of the preset language model can also be added to the output layer of the shared hidden layer, so that the output result of the shared hidden layer can be used as the input item of the preset language model, so that the voice data can be displayed on the screen according to the preset language before the language is determined, reducing the time delay of the upper screen display, and the frequent language changes of the upper screen display results, and improving the user's screen experience. Among them, the preset language can refer to a language that is widely used in a certain region. For example, for the recognition of user voice languages in European countries, English is a language widely used in Europe. At this time, the hidden layer of the preset English model can be added to the output layer of the shared hidden layer, so that the multilingual acoustic model can be displayed on the screen in English and provide a bottom-up effect.

The method of determining the speech language can be realized by introducing a preset language classification model. The high-level hidden layers of multiple mixed bilingual models have language characteristics corresponding to each mixed bilingual model. Specifically, multiple output layers of the high-level hidden layers with language characteristics corresponding to each mixed bilingual model can be combined as the input layer of the preset language classification model to build a preset language classification model.

Among them, the language features used to train the language classification model mainly have high-level abstract features. As shown in Figure 3, this high-level abstract feature is based on the output of the bottom hidden layer and high-level hidden layer of multiple mixed bilingual models. Since the neural network in the model can directly use language features that already have high-level abstract features, there is no need to prepend a large hidden layer for feature extraction. The number of neural network layers guarantees the low latency and low computational load of the language classification model, and at the same time, the splicing of high-level abstract features based on the calculation cache ensures the high recognition effect of the model.

For the composition of the mixed output layer in the multilingual acoustic model, the independently formed mixed output layer can be used to decode the language in the speech data to be recognized and display the corresponding language on the screen. In specific cases, in order to ensure the accuracy of language recognition and reduce the amount of calculation of the multilingual acoustic model, first, the output results of the high-level hidden layers of multiple mixed bilingual models can be cached. The mixed output layer of the bilingual model can be processed, that is, the mixed output layer formed independently by the high-level hidden layers of multiple mixed bilingual models can be softmaxed after the corresponding language is determined based on the preset language classification model.

The output results of the high-level hidden layers of multiple mixed bilingual models are cached to ensure that the softmax calculation is not performed on the output results of the high-level hidden layers before the language is determined. The softmax calculation performed is a tool in machine learning. It can be used to calculate the proportion of each value in a set of values, so as to determine the similarity of each word in the voice data based on the calculated proportion, and screen the words for display on the screen.

As shown in Figure 3, the hidden layer output (i.e. the output result of the high-level hidden layer) before the last layer softmax (i.e. the constructed mixed output layer) in each mixed bilingual model can be cached. If it is not cached at this time, the mixed output layer of each mixed bilingual model needs to perform softmax calculation. The mixed output layer of the mixed bilingual model corresponding to the language can perform softmax calculation on the speech data to be recognized.

It should be noted that multiple mixed bilingual models are neural networks including multiple layers of hidden layers, which adopt the LSTM structure. The multilingual acoustic model constructed based on the shared hidden layer and the high-level hidden layer in each mixed bilingual model also adopts the LSTM structure. When the dimension of the hidden layer in the model is 512, the memory occupied by it in the process of language recognition and voice data display on the screen can be 20*20*512*4byte=0.78MB, and the multilingual acoustic model constructed is suitable for cloud storage.

Step 202, according to the voice data to be recognized and the multilingual acoustic model, the confidence level for each language is obtained;

After obtaining the neural network implementation based on the multi-layer hidden layers contained in multiple mixed bilingual models, specifically the multilingual acoustic model based on the fusion of the shared hidden layers of multiple mixed bilingual models, the multilingual acoustic model can be used to identify the speech data to be recognized, and obtain the confidence for each language, so that the language corresponding to the speech data to be recognized can be determined based on the obtained confidence.

Specifically, the hidden layers of multiple mixed bilingual models include a bottom hidden layer and a high-level hidden layer according to a preset ratio, and the bottom hidden layer is used to merge to generate a shared hidden layer for building a multilingual acoustic model. At this time, the voice data to be recognized can be passed through the shared hidden layers of multiple mixed bilingual models to obtain a first output result without obvious language characteristics, and then the first output results are respectively output to multiple high-level hidden layers of multiple mixed bilingual models to obtain second output results with obvious language characteristics, and then the obtained multiple second outputs can be used As a result, using it as an input item of the preset language classification model results in multiple confidence levels for each language.

Input the speech data to be recognized into the shared hidden layers of multiple mixed bilingual models to obtain the first output result, wherein the shared hidden layer may refer to hidden layers with common parameters in different mixed bilingual models, such as hidden layers for features such as pauses and syllable lengths that are common among different languages.

The high-level hidden layer can refer to the hidden layer with obvious language characteristics in multiple mixed bilingual models. At this time, based on the multiple second output results output by the high-level hidden layer, they can respectively carry the language characteristics for each specific language family. This output result can be used for language recognition. At this time, the output results of the high-level hidden layers of multiple mixed bilingual models, that is, multiple second output results, can be temporarily cached to ensure that the softmax calculation is not performed on the output results of the high-level hidden layers before the language is determined. That is, the mixed output layer calculation of each mixed bilingual model can be suspended at this time. In order to achieve the purpose of reducing the calculation amount of the model.

Among them, the cached multiple second output results need to be softmax calculated when the recognition language is determined. At this time, based on the introduced preset language model, such as the hidden layer of the English model, it can ensure that the on-screen display cannot be displayed on the screen in real time due to the absence of softmax. Without softmax, real-time screen uploading will affect the experience.

In order to determine the language of the voice data to be recognized, a preset language classification model can be introduced. The output results of the high-level hidden layer carry obvious language characteristics for the specific language family. Specifically, multiple output layers of the high-level hidden layer of multiple mixed bilingual models can be used as input layers for training the preset language classification model to construct a preset language classification model, so that when determining the language of the voice data, the corresponding language can be determined through the confidence of the preset language classification model for each language classification.

Specifically, the high-level hidden layers of each mixed bilingual model have language features corresponding to each mixed bilingual model, so the output result of the high-level hidden layer has an obvious language color. At this time, the multi-dimensional feature vector used to represent the second output result can be spliced according to the corresponding dimensions. M-layer convolutional layer conformer, this convolutional layer is used to calculate the softmax score of the language to obtain the confidence for each language. The spliced language features are high-level abstract features. At this time, based on the language differentiation between the spliced language features, multiple confidence levels for different languages can be output in a short period of time without the need to recognize the complete voice request audio. The output confidence levels can be used to judge real-time languages, which can ensure the real-time performance of the speech recognition system when making hybrid model decisions and the accuracy of language classification.

It should be noted that the language features used to train the language classification model mainly have high-level abstract features. As shown in Figure 3, this high-level abstract feature is based on the output of the bottom hidden layer and high-level hidden layer of multiple mixed bilingual models. , in the case of low latency and low calculation of the language classification model based on fewer neural network layers, and high recognition effect of the model based on splicing of high-level abstract features.

Step 203: Determine the language corresponding to the voice data to be recognized based on the confidence level for each language.

In practical applications, the speech data to be recognized can be decoded in real time, and the words of continuous preset length frames obtained by real-time decoding can be input into the language classification model to determine the language result of the real-time speech segment. Specifically, based on the user experience-oriented language classification decision-making design, the words of continuous preset length frames obtained by real-time decoding can be input into the language classification model, and the confidence of the words of the continuous preset length frames for each language is obtained through the language softmax calculation of the language classification model, and the language corresponding to the speech data to be recognized is determined based on the confidence.

Wherein, the confidence level for each language can be used to represent the recognition possibility of the voice data to be recognized and each language, then when determining the language of the voice data, the language corresponding to the voice data to be recognized can be determined based on the multiple confidence levels of each language through the preset language classification model. Specifically, the real-time language result for the input word may be determined based on the judgment result of the confidence level and the preset value.

For each word in the input continuous preset length frame, regardless of whether the speech data to be recognized has timed out (i.e. exceeds the number of words) when decoding the word, in one case, if there is one and only one confidence degree greater than the preset value, that is, there is a certain confidence degree that exceeds the confidence threshold of a certain language, then it can be determined that the language corresponding to the confidence degree is the language corresponding to the speech data to be recognized; The language of the voice data to be recognized is the language corresponding to the voice data to be recognized; in another case, if the multiple confidence levels do not reach the preset value, the corresponding language with the highest confidence value can be the language of the voice data to be recognized. Wherein, the preset value may be a confidence threshold for each language, which is not limited in this embodiment of the present invention.

For example, assuming that the language recognition of the 2nd to 5th frames of the speech data to be recognized can achieve fast and accurate language recognition with low density, then when the speech data is decoded into words, there is no timeout, that is, no more than 5 words. At this time, as long as the preset length frames are continuous, for example, the language classification confidence of each word in 5 consecutive frames for a certain language (i.e. the maximum softmax score of the 21st dimension) exceeds the confidence threshold of 0.8, it means that the language recognition of the speech data to be recognized is over, otherwise it needs to continue Judgment is made on the language of the latest 5 consecutive frames of speech data in the speech data to be recognized; if the speech data to be recognized has been decoded and the words have exceeded 5 words, assuming continuous preset length frames, for example, the confidence of each word in five consecutive frames for each language has not reached the confidence threshold of each language, that is, the confidence standard has not been reached. At this time, the language result with the highest confidence score can be determined from the speech data of the last 5 frames as the final language classification result.

In this application, the voice data to be recognized can also be decoded, and the decoded voice data can be displayed in real time.

Specifically, before determining the language corresponding to the voice data to be recognized, explain the voice data to be recognized and display the decoded voice data in a preset language, and after determining the language corresponding to the voice data to be recognized, use a mixed bilingual model corresponding to the determined language to decode the voice data to be recognized, and continue to display the decoded voice data in the determined language.

In a specific implementation, for the upper-screen display before the language is determined, after the first output result in step 202 is input to the high-level hidden layers of multiple mixed bilingual models, the output results, that is, multiple second output results can be cached to ensure that the softmax calculation is not performed on the output results of the high-level hidden layers before the language is determined, that is, the mixed output layer calculation of each mixed bilingual model can be suspended at this time, and no softmax calculation is performed before the language is determined. The output layer of the containing layer can increase the hidden layer of the preset language model, that is, when the output results of the merged shared hidden layer, that is, the first output results are respectively input into the high-level hidden layers of multiple mixed bilingual models, the first output result can also be input into the hidden layer of the preset language model to obtain the third output result, so that before the language corresponding to the speech data to be recognized is determined, the third output result can be used to display the speech data to be recognized in the preset language.

Exemplarily, as shown in FIG. 3 , the preset language can refer to a language that is widely used in a certain region. For example, for the recognition of user speech languages in countries in the European region, English is a language widely used in the European region. At this time, the hidden layer of the preset English model can be added to the output layer of the shared hidden layer, so that the multilingual acoustic model can be displayed on the screen in English and at the same time provide a bottom-up effect.

And, in order to avoid the problem of frequent changes in the results on the screen, the language replacement operation can be performed on the results on the screen after the language is determined. Specifically, as shown in Figure 3, the high-level hidden layers of a plurality of mixed bilingual models independently form a mixed output layer. At this time, after the language corresponding to the speech data to be recognized is determined, a mixed bilingual model corresponding to the language of the speech data to be recognized can be used to replace and display the displayed speech information with the language corresponding to the speech data to be recognized. Specifically, based on the recognized language, the output result of the high-level hidden layer of the mixed bilingual model corresponding to the language that is cached can be used. Based on the mixed output layer of the mixed bilingual model, the softmax calculation can be performed for speech The output of information replaces the information displayed on the screen in English before, so as to realize the low time consumption and delay of the screen display and improve the user experience.

In practical applications, the results displayed on the upper screen can be replaced by corresponding languages according to the softmax calculation of the mixed bilingual model that matches the real-time language classification results. Among them, the voice data to be recognized may be mixed language audio, such as English + local language audio (assuming English + French). At this time, after the recognition language is determined, the corresponding bilingual mixed acoustic model can be activated, and this model is used to calculate the softmax of the previously cached cache output layer to realize the recognition of each word, and replace the recognized French word with the word in the upper screen result.

In this application, by adopting the multilingual acoustic model generated by fusion of shared hidden layers based on multiple mixed bilingual models, the speech data to be recognized is recognized to obtain the confidence degree for each language, and the language corresponding to the speech data to be recognized is determined based on the obtained confidence degree, and the multilingual recognition of the speech is completed. The multilingual acoustic model based on the fusion of the shared hidden layer of multiple mixed bilingual models is used to recognize the multilingual speech. Based on the shared hidden layer in the model, the amount of calculation in the traditional multilingual recognition model is reduced, the efficiency of language recognition is improved, and the user experience is improved.

Referring to FIG. 4 , it shows a schematic diagram of the application of the multilingual acoustic model provided by the present application. The constructed multilingual acoustic model can be applied to the scene of recognizing the user's personalized voice, and the corresponding acoustic language working mechanism can be divided into two stages of decoding and displaying on the screen.

Specifically, before using the constructed multilingual acoustic model to determine the language, based on the language-related models widely used in this region, such as the English acoustic model and the English language model, the streamed screen results can be decoded to ensure user experience, and at the same time, the output of the hidden layer can be cached based on the multilingual acoustic model for language determination; The mixed output layer composed of multiple layers performs softmax calculation, so that the result of softmax calculation is used to replace the result on the screen, and at the same time, the language model of the corresponding language is called for normal decoding.

The user's speech recognition can be manifested as in the process of decoding the speech data to be recognized, based on the user's IP (Internet Protocol, communication protocol) address, the resource information of the city where the corresponding IP address is located and the language information determined based on the multilingual acoustic model can be called to improve the recognition rate of the speech data to be recognized. Among them, as shown in Figure 4, the resource can refer to the additional Ngram model (an algorithm based on statistical language model) trained by place names. The general neural network NNLM (Nerual Network Language Model) can be obtained based on the text training related to the place name of the POI (Point of Interest) place name of the entire country corresponding to the recognized language. The personalized city-level model, compared with the general neural network NNLM, is mainly based on the POI data of the place name text training of the corresponding city ( A small amount), and in consideration of the amount of calculation and storage, the size of the personalized city-level model is small, and the construction of the personalized language model is completed.

In this application scenario, the multilingual acoustic model can be used to build a language model based on the language of the user's voice data to be recognized and the user's resource information, so as to comprehensively use the user's resource information to realize the recognition of the user's language and improve the accuracy of language recognition.

It should be noted that for the method, for the sake of simple description, it is expressed as a series of action combinations, but those skilled in the art should know that the application is not limited by the described action order, because according to the application, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that what is described in the specification is preferred, and the actions involved are not necessarily required by the present application.

Referring to Fig. 5, it shows a structural block diagram of the multilingual recognition device for speech provided by the present application, which may specifically include the following modules:

The multilingual acoustic model acquisition module 501 is used to acquire speech data to be recognized and a multilingual acoustic model; the multilingual acoustic model is obtained based on the fusion of shared hidden layers of multiple mixed bilingual models;

Confidence degree generation module 502, for obtaining the confidence degree for each language according to the speech data to be recognized and the multilingual acoustic model;

The language recognition module 503 is configured to determine the language corresponding to the voice data to be recognized based on the confidence for each language.

In the multilingual recognition device of speech, described device can also comprise following module:

The upper screen display module is used to decode the speech data to be recognized, and display the decoded speech data in real time.

In the multilingual recognition device of speech, the hidden layers of the plurality of mixed bilingual models include a bottom layer hidden layer and a high layer hidden layer according to a preset ratio, and the bottom layer hidden layer is used for merging to generate a shared hidden layer; the confidence generation module 502 may include the following submodules:

The first output result generation sub-module is used to input the voice data to be recognized into the shared hidden layer of multiple mixed bilingual models to obtain the first output result;

The second output result generation sub-module is used to input the first output results into the high-level hidden layers of the multiple mixed bilingual models respectively to obtain multiple second output results;

The confidence generation sub-module is used to combine the multiple second output results as input items of the preset language classification model to obtain multiple confidence levels for each language.

In the speech multilingual recognition device, the confidence generation submodule is specifically used to splice the multi-dimensional feature vectors used to characterize the second output results according to corresponding dimensions, and use the spliced feature vectors as input items of the preset language classification model to obtain multiple confidence levels for different languages.

In the multilingual voice recognition device, the language recognition module 503 is specifically used to determine that the language corresponding to the confidence is the language corresponding to the voice data to be recognized when there is only one confidence value greater than a preset value; or, when there are two or more confidence values greater than a preset value, determine that the corresponding language with the largest confidence value is the language corresponding to the voice data to be recognized; The language of the voice data to be recognized.

In the voice multilingual recognition device, the upper-screen display module is specifically used for decoding the voice data to be recognized and displaying the decoded voice data in a preset language before determining the language corresponding to the voice data to be recognized; and after determining the language corresponding to the voice data to be recognized, using a mixed bilingual model corresponding to the determined language to decode the voice data to be recognized, and continue to perform replacement display of the determined language on the decoded voice data.

In the multilingual speech recognition device, the multilingual acoustic model includes a preset language model, and when the upper screen display module displays the preset language, the third output result is obtained by inputting the first output result into the hidden layer of the preset language model; wherein the preset language model is located at the output layer of the shared hidden layer;

Before determining the language corresponding to the voice data to be recognized, the third output result is decoded to obtain recognized voice information, and displayed in a preset language.

In the voice multilingual recognition device, the high-level hidden layers of the plurality of mixed bilingual models independently form a mixed output layer, and when the upper-screen display module performs the replacement display of the determined language, after determining the language corresponding to the voice data to be recognized, the displayed voice information is replaced and displayed in the determined language by using a mixed bilingual model corresponding to the language of the voice data to be recognized.

In the multilingual recognition device of speech, the multilingual acoustic model is established based on a neural network, and the device may also include the following modules:

The multilingual acoustic model generation module is used to generate a multilingual acoustic model based on the fusion of shared hidden layers of multiple mixed bilingual models.

Among them, the multilingual acoustic model generation module is specifically used to divide the hidden layers of multiple mixed bilingual models into a bottom hidden layer and a high-level hidden layer according to a preset ratio, and merge the bottom hidden layers to generate a shared hidden layer, wherein the multiple mixed bilingual models are neural networks including multiple hidden layers, and the high-level hidden layers of the multiple mixed bilingual models have language characteristics corresponding to each mixed bilingual model; add a hidden layer of a preset language model to the output layer of the shared hidden layer; A plurality of output layers of the high-level hidden layer of language features are merged as the input layer of the preset language classification model to construct a preset language classification model, and the high-level hidden layers of the multiple mixed bilingual models are respectively independently formed into a mixed output layer, so as to use the shared hidden layer, the hidden layer of the preset language model, the high-level hidden layer, the preset language classification model and the mixed output layer to generate a multilingual acoustic model.

The present application also provides a vehicle-mounted terminal, including:

The multilingual recognition device comprising the above-mentioned voice, a processor, a memory, and a computer program stored on the memory and capable of running on the processor, when the computer program is executed by the processor, realizes the various processes of the above-mentioned voice-based multilingual recognition method, and can achieve the same technical effect. To avoid repetition, details are not repeated here. The present application also provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the various processes of the above-mentioned multilingual recognition method for speech can be realized, and the same technical effect can be achieved. In order to avoid repetition, details are not repeated here.

Each example in this specification is described in a progressive manner, each example focuses on the difference from other examples, and the same and similar parts of each example can be referred to each other.

Those skilled in the art should understand that what is exemplified in the present application may be provided as a method, an apparatus, or a computer program product. Accordingly, the examples of the present application may take the form of entirely hardware, entirely software, or a combination of software and hardware aspects. Furthermore, the present examples may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Examples of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to examples of the present application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to a general-purpose computer, a special-purpose computer, an embedded processor or a processor of other programmable data processing terminal equipment to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing terminal equipment produce an apparatus for realizing the functions specified in one or more procedures of the flow chart and/or one or more blocks of the block diagram.

These computer program instructions can also be stored in a computer-readable memory capable of directing a computer or other programmable data processing terminal equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product comprising instruction means, and the instruction means implements the functions specified in one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing terminal equipment, so that a series of operation steps are executed on the computer or other programmable terminal equipment to generate computer-implemented processing, so that the instructions executed on the computer or other programmable terminal equipment provide steps for realizing the functions specified in one or more procedures of the flow chart and/or one or more blocks of the block diagram.

While preferred examples of the present application have been described, additional alterations and modifications to these examples can be made by those skilled in the art once the basic inventive concepts are appreciated. Therefore, the appended claims are intended to be interpreted to cover the preferred examples and all changes and modifications that fall within the scope of the examples of the application.

Finally, it should also be noted that in this document, relational terms such as first and second etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or end-equipment comprising a set of elements includes not only those elements but also other elements not expressly listed or which are inherent to such a process, method, article or end-equipment. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or terminal device comprising said element.

Having described various embodiments of the present application above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims

A multilingual recognition method of speech, characterized in that the method comprises:

Obtain speech data to be recognized and a multilingual acoustic model; the multilingual acoustic model is obtained based on the fusion of shared hidden layers of multiple mixed bilingual models;

Obtaining confidence levels for each language according to the speech data to be recognized and the multilingual acoustic model;

The language corresponding to the voice data to be recognized is determined based on the confidence for each language.
The method according to claim 1, further comprising:

Decoding the speech data to be recognized, and displaying the decoded speech data in real time.
The method according to claim 2, wherein the hidden layers of the plurality of mixed bilingual models include a bottom layer hidden layer and a high layer hidden layer according to a preset ratio, and the bottom layer hidden layer is used for merging to generate a shared hidden layer;

According to the speech data to be recognized and the multilingual acoustic model, the confidence for each language is obtained, including:

Inputting the voice data to be recognized into the shared hidden layers of multiple mixed bilingual models to obtain the first output result;

Inputting the first output results into the high-level hidden layers of the multiple mixed bilingual models respectively to obtain multiple second output results;

The multiple second output results are combined as input items of the preset language classification model to obtain multiple confidence levels for each language.
The method according to claim 3, wherein said combining said plurality of second output results as an input item of a preset language classification model to obtain a plurality of confidence levels for each language includes:

The multi-dimensional feature vectors used to characterize the second output result are spliced according to corresponding dimensions, and the spliced feature vectors are used as input items of the preset language classification model to obtain multiple confidence levels for different languages.
The method according to any one of claims 1 to 4, wherein the determining the language corresponding to the speech data to be recognized based on the confidence for each language includes:

If there is and only one of the confidence levels is greater than a preset value, then it is determined that the language corresponding to the confidence level is the language corresponding to the voice data to be recognized;

Or, if there are two or more confidence values greater than the preset value, then determine that the corresponding language with the largest confidence value is the language corresponding to the voice data to be recognized;

Or, if none of the multiple confidence levels reaches the preset value, the corresponding language with the largest confidence level is the language of the voice data to be recognized.
The method according to claim 3, wherein said displaying the decoded voice data in real time comprises:

Before determining the language corresponding to the speech data to be recognized, decoding the speech data to be recognized and displaying the decoded speech data in a preset language;

After determining the language corresponding to the speech data to be recognized, the speech data to be recognized is decoded by using a mixed bilingual model corresponding to the determined language, and the decoded speech data is continued to be replaced with the determined language.
The method according to claim 6, wherein the multilingual acoustic model includes a preset language model, and before determining the language corresponding to the speech data to be recognized, parsing the speech data to be recognized and displaying the parsed speech data in a preset language includes:

Inputting the first output result into the hidden layer of the preset language model to obtain a third output result; wherein the preset language model is located at the output layer of the shared hidden layer;

Before determining the language corresponding to the voice data to be recognized, the third output result is decoded to obtain recognized voice information, and displayed in a preset language.
The method according to claim 6, wherein the high-level hidden layers of the plurality of mixed bilingual models independently form a mixed output layer, and after determining the language corresponding to the speech data to be recognized, using the mixed bilingual model corresponding to the determined language to analyze the speech data to be recognized, and continuing to display the speech data after analysis The determined language includes:

After the language corresponding to the voice data to be recognized is determined, the displayed voice information is replaced and displayed in the determined language by using a mixed bilingual model corresponding to the language of the voice data to be recognized.
The method according to claim 1, wherein the multilingual acoustic model is established based on a neural network, further comprising:

The hidden layers of multiple mixed bilingual models are divided into bottom hidden layers and high-level hidden layers according to preset ratios, and the bottom hidden layers are merged to generate a shared hidden layer, wherein the multiple mixed bilingual models are neural networks comprising multiple hidden layers, and the high-level hidden layers of the multiple mixed bilingual models have language characteristics corresponding to each mixed bilingual model;

Adding a hidden layer of a preset language model to the output layer of the shared hidden layer;

Combining the multiple output layers of the high-level hidden layer with language characteristics corresponding to each mixed bilingual model into the input layer of the preset language classification model to construct the preset language classification model, and independently forming a mixed output layer with the high-level hidden layers of the multiple mixed bilingual models;

A multilingual acoustic model is generated by using the shared hidden layer, the hidden layer of the preset language model, the high-level hidden layer, the preset language classification model, and the mixed output layer.
A multilingual recognition device for speech, characterized in that the device includes:

The multilingual acoustic model acquisition module is used to obtain the speech data to be recognized and the multilingual acoustic model; the multilingual acoustic model is obtained based on the fusion of shared hidden layers of multiple mixed bilingual models;

A confidence level generation module, configured to obtain confidence levels for each language according to the speech data to be recognized and the multilingual acoustic model;

The language recognition module is configured to determine the language corresponding to the speech data to be recognized based on the confidence for each language.
A vehicle-mounted terminal, characterized in that it comprises: a multilingual recognition device for speech as claimed in claim 10, a processor, a memory, and a computer program stored on the memory and capable of running on the processor, when the computer program is executed by the processor, the steps of the multilingual recognition method for speech as described in any one of claims 1-9 are realized.
A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the multilingual voice recognition method according to any one of claims 1-9 are realized.