WO2023138286A1 - Procédé et appareil de reconnaissance multilingue pour une parole, et terminal et support de stockage - Google Patents
Procédé et appareil de reconnaissance multilingue pour une parole, et terminal et support de stockage Download PDFInfo
- Publication number
- WO2023138286A1 WO2023138286A1 PCT/CN2022/140282 CN2022140282W WO2023138286A1 WO 2023138286 A1 WO2023138286 A1 WO 2023138286A1 CN 2022140282 W CN2022140282 W CN 2022140282W WO 2023138286 A1 WO2023138286 A1 WO 2023138286A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- language
- recognized
- model
- mixed
- bilingual
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 230000004927 fusion Effects 0.000 claims abstract description 14
- 238000013145 classification model Methods 0.000 claims description 30
- 238000004590 computer program Methods 0.000 claims description 19
- 238000013528 artificial neural network Methods 0.000 claims description 16
- 230000015654 memory Effects 0.000 claims description 12
- 239000013598 vector Substances 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 description 28
- 238000010586 diagram Methods 0.000 description 14
- 230000008569 process Effects 0.000 description 11
- 230000000694 effects Effects 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 5
- 238000010276 construction Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000004075 alteration Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Definitions
- the present application relates to the field of computer technology, and in particular to a multilingual voice recognition method, a corresponding multilingual voice recognition device, a corresponding vehicle-mounted terminal and a computer-readable storage medium.
- the speech data to be recognized may not only be speech in a single language, but may also be bilingual mixed speech or multilingual mixed speech.
- the speech data to be recognized may not only be speech in a single language, but may also be bilingual mixed speech or multilingual mixed speech.
- the first aspect of the present application provides a multilingual voice recognition method, including:
- the multilingual acoustic model is obtained based on the fusion of shared hidden layers of multiple mixed bilingual models
- the language corresponding to the voice data to be recognized is determined based on the confidence for each language.
- the second aspect of the present application provides a multilingual recognition device, including:
- the multilingual acoustic model acquisition module is used to obtain the speech data to be recognized and the multilingual acoustic model; the multilingual acoustic model is obtained based on the shared hidden layer fusion of multiple mixed bilingual models;
- a confidence level generation module configured to obtain confidence levels for each language according to the speech data to be recognized and the multilingual acoustic model
- the language recognition module is configured to determine the language corresponding to the speech data to be recognized based on the confidence for each language.
- the third aspect of the present application provides a vehicle-mounted terminal, including: the processor, a memory, and a computer program stored on the memory and capable of running on the processor.
- the computer program is executed by the processor, any one of the steps of multilingual recognition of the voice is realized.
- the fourth aspect of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, any one of the steps of the multilingual acoustic model-based language recognition method or any one of the steps of the multilingual recognition of speech is implemented.
- the speech data to be recognized is recognized by using the multilingual acoustic model generated by the fusion of shared hidden layers based on multiple mixed bilingual models to obtain the confidence for each language, and the language corresponding to the speech data to be recognized is determined based on the obtained confidence to complete the multilingual recognition of the speech.
- the multilingual acoustic model based on the fusion of the shared hidden layers of multiple mixed bilingual models is used to recognize the multilingual speech. Based on the shared hidden layer in the model, the amount of calculation in the traditional multilingual recognition model is reduced, the efficiency of language recognition is improved, and the user experience is improved.
- Fig. 1 is a model schematic diagram of a multilingual acoustic model in the related art
- Fig. 2 is the flow chart of the steps of the multilingual recognition method of speech provided by the present application
- Fig. 3 is a model schematic diagram of the multilingual acoustic model provided by the present application.
- Fig. 4 is a schematic diagram of the application of the multilingual acoustic model provided by the present application.
- Fig. 5 is a structural block diagram of the multilingual speech recognition method and device provided by the present application.
- first, second, third and so on may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another.
- first information may also be called second information, and similarly, second information may also be called first information.
- second information may also be called first information.
- a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features.
- “plurality” means two or more, unless otherwise specifically defined.
- the present application provides a multilingual voice recognition method, a corresponding multilingual voice recognition device, a corresponding vehicle-mounted terminal, and a computer-readable storage medium that overcome the problems mentioned in the background technology or at least partially solve the problems mentioned in the background technology.
- the speech data to be recognized may not only be speech in a single language, but may also be bilingual mixed speech or multilingual mixed speech.
- the speech data to be recognized may not only be speech in a single language, but may also be bilingual mixed speech or multilingual mixed speech.
- regions such as the world, Asia, and Europe
- POI Point of Interest
- command word recognition for other major countries and regions. From the perspective of modeling cost and user experience, the multi-language recognition system established needs to meet the characteristics of less resource occupation and fast recognition speed.
- FIG. 1 a schematic diagram of a multilingual acoustic model in the related art is shown.
- the preset language widely used in this area is English.
- English is a widely used language, using, for example, English-German (its neural network layer such as N-layer LSTM (Long short-term memory, long short-term memory network) hidden layer can be used for the English factor feature vector and German factor feature vectors to calculate the softmax score through its mixed output layer), English-French (its neural network layer, such as the M-layer LSTM hidden layer can be used to output English factor feature vectors and French factor feature vectors through its mixed output layer to perform softmax score calculation) and more than 20 sets of mixed bilingual models. Modeling is mainly to use the corresponding language system for mixed bilingual modeling of place names, person names, and institutional names in this region.
- English-German its neural network layer such as N-layer LSTM (Long short-term memory, long short-term memory network) hidden layer can be used for the English factor feature vector and German factor feature vectors to calculate the softmax score through its mixed output layer
- English-French its neural network layer, such as the M-layer LSTM hidden layer can be used to output English factor feature
- each group of mixed bilinguals such as English-German, English-French, etc.
- multiple mixed bilingual models are obtained.
- the language corresponding to the speech data to be recognized can be determined.
- this language recognition method based on the output scores of multiple sets of acoustic models involves modeling more than 20 sets of mixed bilingual models such as English-German and English-French. Reducing the size of the acoustic model requires the use of a CPU (Central Processing Unit, central processing unit) with stronger performance. The reduction in the size of the acoustic model is manifested in the reduction of the dimension of the feature vector and the reduction of the number of neural network layers. Reducing the selection of features by the neural network in each layer of the acoustic model will lead to poor recognition of the model, and the scoring PK for multiple groups of language scores. For user experience, the frequent changes in the results on the screen and the time-consuming score PK model will increase the delay on the screen, which cannot meet the needs of less resources and recognition speed. Faster requirements affect the user's experience on the screen.
- CPU Central Processing Unit, central processing unit
- FIG. 2 shows a flow chart of the steps of the multilingual recognition method of speech provided by the present application, which may specifically include the following steps:
- Step 201 acquiring speech data to be recognized and a multilingual acoustic model, wherein the multilingual acoustic model is fused based on shared hidden layers of multiple mixed bilingual models;
- multilingual speech recognition can be performed by using a multilingual acoustic model based on the fusion of shared hidden layers of multiple mixed bilingual models. Based on the shared hidden layer in the model, the amount of calculation in the traditional multilingual recognition model can be reduced, and the efficiency of language recognition can be improved.
- the multilingual acoustic model used in this application can be constructed before obtaining the multilingual acoustic model based on the fusion of the shared hidden layers of multiple mixed bilingual models.
- one of the core ideas of the multilingual acoustic model constructed by the present application is to combine the underlying hidden layers of multiple groups of mixed bilingual models to generate a shared hidden layer, and reduce the memory consumption of the built multilingual acoustic model based on the merged shared hidden layers; and in the process of using the mixed bilingual model to recognize the speech data to be recognized, the language classification process is added based on the preset language classification model, and the output results of each high-level cache layer are cached before the language recognition is determined, so that the corresponding language of the speech data can be subsequently displayed on the screen. It can be displayed based on the mixed bilingual model with the determined language, reducing the amount of calculation of the multilingual acoustic model.
- the multilingual acoustic model can be generated by using the shared hidden layer, the hidden layer of the preset language model, the high-level hidden layer, the preset language classification model and the constructed mixed output layer.
- each mixed bilingual model has a neural network with multiple hidden layers.
- the characteristic vector dimension, such as pauses and syllable lengths that are common among different languages, the multi-layer hidden layer of the mixed bilingual model can also include a hidden layer with its own language characteristics.
- the hidden layer with parameter commonality can be called the bottom hidden layer
- the hidden layer with obvious language characteristics can be called the high-level hidden layer.
- each group of mixed bilinguals can be constructed based on the preset languages widely used in this region and other languages, and is not limited to English-German, English-French and other acoustic models.
- FIG. 3 it shows a schematic diagram of the multilingual acoustic model provided by the present application.
- multiple mixed bilingual models such as English-German, English-French and other mixed bilingual models, can divide the hidden layers contained in multiple sets of mixed bilingual models based on whether they have obvious language characteristics.
- each mixed bilingual model has 80% of the underlying hidden layer and 20% of the high-level hidden layer.
- Divide merge the underlying hidden layers into shared hidden layers, and retain the high-level hidden layers that obviously have the characteristics of a specific language family, based on the introduction of the underlying hidden layers, improve the hardware adaptation of the multilingual acoustic model built on the device.
- the output result of the merged shared hidden layer can be used as the input item of each retained high-level hidden layer, so that the voice data can be displayed in the corresponding language based on the mixed output layer with the determined language when outputting the high-level hidden layer. While reducing memory consumption and calculation load, the modeling accuracy of the multilingual acoustic model can be improved based on the reservation of the high-level hidden layer.
- the hidden layer of the preset language model can also be added to the output layer of the shared hidden layer, so that the output result of the shared hidden layer can be used as the input item of the preset language model, so that the voice data can be displayed on the screen according to the preset language before the language is determined, reducing the time delay of the upper screen display, and the frequent language changes of the upper screen display results, and improving the user's screen experience.
- the preset language can refer to a language that is widely used in a certain region. For example, for the recognition of user voice languages in European countries, English is a language widely used in Europe. At this time, the hidden layer of the preset English model can be added to the output layer of the shared hidden layer, so that the multilingual acoustic model can be displayed on the screen in English and provide a bottom-up effect.
- the method of determining the speech language can be realized by introducing a preset language classification model.
- the high-level hidden layers of multiple mixed bilingual models have language characteristics corresponding to each mixed bilingual model.
- multiple output layers of the high-level hidden layers with language characteristics corresponding to each mixed bilingual model can be combined as the input layer of the preset language classification model to build a preset language classification model.
- the language features used to train the language classification model mainly have high-level abstract features.
- this high-level abstract feature is based on the output of the bottom hidden layer and high-level hidden layer of multiple mixed bilingual models. Since the neural network in the model can directly use language features that already have high-level abstract features, there is no need to prepend a large hidden layer for feature extraction.
- the number of neural network layers guarantees the low latency and low computational load of the language classification model, and at the same time, the splicing of high-level abstract features based on the calculation cache ensures the high recognition effect of the model.
- the independently formed mixed output layer can be used to decode the language in the speech data to be recognized and display the corresponding language on the screen.
- the output results of the high-level hidden layers of multiple mixed bilingual models can be cached.
- the mixed output layer of the bilingual model can be processed, that is, the mixed output layer formed independently by the high-level hidden layers of multiple mixed bilingual models can be softmaxed after the corresponding language is determined based on the preset language classification model.
- the output results of the high-level hidden layers of multiple mixed bilingual models are cached to ensure that the softmax calculation is not performed on the output results of the high-level hidden layers before the language is determined.
- the softmax calculation performed is a tool in machine learning. It can be used to calculate the proportion of each value in a set of values, so as to determine the similarity of each word in the voice data based on the calculated proportion, and screen the words for display on the screen.
- the hidden layer output i.e. the output result of the high-level hidden layer
- the last layer softmax i.e. the constructed mixed output layer
- the mixed output layer of each mixed bilingual model needs to perform softmax calculation.
- the mixed output layer of the mixed bilingual model corresponding to the language can perform softmax calculation on the speech data to be recognized.
- multiple mixed bilingual models are neural networks including multiple layers of hidden layers, which adopt the LSTM structure.
- the multilingual acoustic model constructed based on the shared hidden layer and the high-level hidden layer in each mixed bilingual model also adopts the LSTM structure.
- the dimension of the hidden layer in the model is 512
- the multilingual acoustic model constructed is suitable for cloud storage.
- Step 202 according to the voice data to be recognized and the multilingual acoustic model, the confidence level for each language is obtained;
- the multilingual acoustic model can be used to identify the speech data to be recognized, and obtain the confidence for each language, so that the language corresponding to the speech data to be recognized can be determined based on the obtained confidence.
- the hidden layers of multiple mixed bilingual models include a bottom hidden layer and a high-level hidden layer according to a preset ratio, and the bottom hidden layer is used to merge to generate a shared hidden layer for building a multilingual acoustic model.
- the voice data to be recognized can be passed through the shared hidden layers of multiple mixed bilingual models to obtain a first output result without obvious language characteristics, and then the first output results are respectively output to multiple high-level hidden layers of multiple mixed bilingual models to obtain second output results with obvious language characteristics, and then the obtained multiple second outputs can be used As a result, using it as an input item of the preset language classification model results in multiple confidence levels for each language.
- the shared hidden layer may refer to hidden layers with common parameters in different mixed bilingual models, such as hidden layers for features such as pauses and syllable lengths that are common among different languages.
- the high-level hidden layer can refer to the hidden layer with obvious language characteristics in multiple mixed bilingual models.
- they can respectively carry the language characteristics for each specific language family.
- This output result can be used for language recognition.
- the output results of the high-level hidden layers of multiple mixed bilingual models that is, multiple second output results, can be temporarily cached to ensure that the softmax calculation is not performed on the output results of the high-level hidden layers before the language is determined. That is, the mixed output layer calculation of each mixed bilingual model can be suspended at this time. In order to achieve the purpose of reducing the calculation amount of the model.
- the cached multiple second output results need to be softmax calculated when the recognition language is determined.
- the introduced preset language model such as the hidden layer of the English model, it can ensure that the on-screen display cannot be displayed on the screen in real time due to the absence of softmax. Without softmax, real-time screen uploading will affect the experience.
- a preset language classification model In order to determine the language of the voice data to be recognized, a preset language classification model can be introduced.
- the output results of the high-level hidden layer carry obvious language characteristics for the specific language family.
- multiple output layers of the high-level hidden layer of multiple mixed bilingual models can be used as input layers for training the preset language classification model to construct a preset language classification model, so that when determining the language of the voice data, the corresponding language can be determined through the confidence of the preset language classification model for each language classification.
- the high-level hidden layers of each mixed bilingual model have language features corresponding to each mixed bilingual model, so the output result of the high-level hidden layer has an obvious language color.
- the multi-dimensional feature vector used to represent the second output result can be spliced according to the corresponding dimensions.
- M-layer convolutional layer conformer this convolutional layer is used to calculate the softmax score of the language to obtain the confidence for each language.
- the spliced language features are high-level abstract features.
- multiple confidence levels for different languages can be output in a short period of time without the need to recognize the complete voice request audio.
- the output confidence levels can be used to judge real-time languages, which can ensure the real-time performance of the speech recognition system when making hybrid model decisions and the accuracy of language classification.
- the language features used to train the language classification model mainly have high-level abstract features. As shown in Figure 3, this high-level abstract feature is based on the output of the bottom hidden layer and high-level hidden layer of multiple mixed bilingual models. , in the case of low latency and low calculation of the language classification model based on fewer neural network layers, and high recognition effect of the model based on splicing of high-level abstract features.
- Step 203 Determine the language corresponding to the voice data to be recognized based on the confidence level for each language.
- the speech data to be recognized can be decoded in real time, and the words of continuous preset length frames obtained by real-time decoding can be input into the language classification model to determine the language result of the real-time speech segment.
- the words of continuous preset length frames obtained by real-time decoding can be input into the language classification model, and the confidence of the words of the continuous preset length frames for each language is obtained through the language softmax calculation of the language classification model, and the language corresponding to the speech data to be recognized is determined based on the confidence.
- the confidence level for each language can be used to represent the recognition possibility of the voice data to be recognized and each language, then when determining the language of the voice data, the language corresponding to the voice data to be recognized can be determined based on the multiple confidence levels of each language through the preset language classification model. Specifically, the real-time language result for the input word may be determined based on the judgment result of the confidence level and the preset value.
- the preset value may be a confidence threshold for each language, which is not limited in this embodiment of the present invention.
- the language recognition of the 2nd to 5th frames of the speech data to be recognized can achieve fast and accurate language recognition with low density
- there is no timeout that is, no more than 5 words.
- the preset length frames are continuous, for example, the language classification confidence of each word in 5 consecutive frames for a certain language (i.e.
- the maximum softmax score of the 21st dimension exceeds the confidence threshold of 0.8, it means that the language recognition of the speech data to be recognized is over, otherwise it needs to continue Judgment is made on the language of the latest 5 consecutive frames of speech data in the speech data to be recognized; if the speech data to be recognized has been decoded and the words have exceeded 5 words, assuming continuous preset length frames, for example, the confidence of each word in five consecutive frames for each language has not reached the confidence threshold of each language, that is, the confidence standard has not been reached. At this time, the language result with the highest confidence score can be determined from the speech data of the last 5 frames as the final language classification result.
- the voice data to be recognized can also be decoded, and the decoded voice data can be displayed in real time.
- the output results can be cached to ensure that the softmax calculation is not performed on the output results of the high-level hidden layers before the language is determined, that is, the mixed output layer calculation of each mixed bilingual model can be suspended at this time, and no softmax calculation is performed before the language is determined.
- the output layer of the containing layer can increase the hidden layer of the preset language model, that is, when the output results of the merged shared hidden layer, that is, the first output results are respectively input into the high-level hidden layers of multiple mixed bilingual models, the first output result can also be input into the hidden layer of the preset language model to obtain the third output result, so that before the language corresponding to the speech data to be recognized is determined, the third output result can be used to display the speech data to be recognized in the preset language.
- the preset language can refer to a language that is widely used in a certain region.
- English is a language widely used in the European region.
- the hidden layer of the preset English model can be added to the output layer of the shared hidden layer, so that the multilingual acoustic model can be displayed on the screen in English and at the same time provide a bottom-up effect.
- the language replacement operation can be performed on the results on the screen after the language is determined.
- the high-level hidden layers of a plurality of mixed bilingual models independently form a mixed output layer.
- a mixed bilingual model corresponding to the language of the speech data to be recognized can be used to replace and display the displayed speech information with the language corresponding to the speech data to be recognized.
- the output result of the high-level hidden layer of the mixed bilingual model corresponding to the language that is cached can be used.
- the softmax calculation can be performed for speech The output of information replaces the information displayed on the screen in English before, so as to realize the low time consumption and delay of the screen display and improve the user experience.
- the results displayed on the upper screen can be replaced by corresponding languages according to the softmax calculation of the mixed bilingual model that matches the real-time language classification results.
- the voice data to be recognized may be mixed language audio, such as English + local language audio (assuming English + French).
- the corresponding bilingual mixed acoustic model can be activated, and this model is used to calculate the softmax of the previously cached cache output layer to realize the recognition of each word, and replace the recognized French word with the word in the upper screen result.
- the speech data to be recognized is recognized to obtain the confidence degree for each language, and the language corresponding to the speech data to be recognized is determined based on the obtained confidence degree, and the multilingual recognition of the speech is completed.
- the multilingual acoustic model based on the fusion of the shared hidden layer of multiple mixed bilingual models is used to recognize the multilingual speech. Based on the shared hidden layer in the model, the amount of calculation in the traditional multilingual recognition model is reduced, the efficiency of language recognition is improved, and the user experience is improved.
- FIG. 4 it shows a schematic diagram of the application of the multilingual acoustic model provided by the present application.
- the constructed multilingual acoustic model can be applied to the scene of recognizing the user's personalized voice, and the corresponding acoustic language working mechanism can be divided into two stages of decoding and displaying on the screen.
- the streamed screen results can be decoded to ensure user experience, and at the same time, the output of the hidden layer can be cached based on the multilingual acoustic model for language determination;
- the mixed output layer composed of multiple layers performs softmax calculation, so that the result of softmax calculation is used to replace the result on the screen, and at the same time, the language model of the corresponding language is called for normal decoding.
- the user's speech recognition can be manifested as in the process of decoding the speech data to be recognized, based on the user's IP (Internet Protocol, communication protocol) address, the resource information of the city where the corresponding IP address is located and the language information determined based on the multilingual acoustic model can be called to improve the recognition rate of the speech data to be recognized.
- the resource can refer to the additional Ngram model (an algorithm based on statistical language model) trained by place names.
- the general neural network NNLM “Nerual Network Language Model) can be obtained based on the text training related to the place name of the POI (Point of Interest) place name of the entire country corresponding to the recognized language.
- the personalized city-level model compared with the general neural network NNLM, is mainly based on the POI data of the place name text training of the corresponding city ( A small amount), and in consideration of the amount of calculation and storage, the size of the personalized city-level model is small, and the construction of the personalized language model is completed.
- the multilingual acoustic model can be used to build a language model based on the language of the user's voice data to be recognized and the user's resource information, so as to comprehensively use the user's resource information to realize the recognition of the user's language and improve the accuracy of language recognition.
- FIG. 5 shows a structural block diagram of the multilingual recognition device for speech provided by the present application, which may specifically include the following modules:
- the multilingual acoustic model acquisition module 501 is used to acquire speech data to be recognized and a multilingual acoustic model; the multilingual acoustic model is obtained based on the fusion of shared hidden layers of multiple mixed bilingual models;
- Confidence degree generation module 502 for obtaining the confidence degree for each language according to the speech data to be recognized and the multilingual acoustic model
- the language recognition module 503 is configured to determine the language corresponding to the voice data to be recognized based on the confidence for each language.
- described device can also comprise following module:
- the upper screen display module is used to decode the speech data to be recognized, and display the decoded speech data in real time.
- the hidden layers of the plurality of mixed bilingual models include a bottom layer hidden layer and a high layer hidden layer according to a preset ratio, and the bottom layer hidden layer is used for merging to generate a shared hidden layer;
- the confidence generation module 502 may include the following submodules:
- the first output result generation sub-module is used to input the voice data to be recognized into the shared hidden layer of multiple mixed bilingual models to obtain the first output result;
- the second output result generation sub-module is used to input the first output results into the high-level hidden layers of the multiple mixed bilingual models respectively to obtain multiple second output results;
- the confidence generation sub-module is used to combine the multiple second output results as input items of the preset language classification model to obtain multiple confidence levels for each language.
- the confidence generation submodule is specifically used to splice the multi-dimensional feature vectors used to characterize the second output results according to corresponding dimensions, and use the spliced feature vectors as input items of the preset language classification model to obtain multiple confidence levels for different languages.
- the language recognition module 503 is specifically used to determine that the language corresponding to the confidence is the language corresponding to the voice data to be recognized when there is only one confidence value greater than a preset value; or, when there are two or more confidence values greater than a preset value, determine that the corresponding language with the largest confidence value is the language corresponding to the voice data to be recognized; The language of the voice data to be recognized.
- the upper-screen display module is specifically used for decoding the voice data to be recognized and displaying the decoded voice data in a preset language before determining the language corresponding to the voice data to be recognized; and after determining the language corresponding to the voice data to be recognized, using a mixed bilingual model corresponding to the determined language to decode the voice data to be recognized, and continue to perform replacement display of the determined language on the decoded voice data.
- the multilingual acoustic model includes a preset language model, and when the upper screen display module displays the preset language, the third output result is obtained by inputting the first output result into the hidden layer of the preset language model; wherein the preset language model is located at the output layer of the shared hidden layer;
- the third output result is decoded to obtain recognized voice information, and displayed in a preset language.
- the high-level hidden layers of the plurality of mixed bilingual models independently form a mixed output layer, and when the upper-screen display module performs the replacement display of the determined language, after determining the language corresponding to the voice data to be recognized, the displayed voice information is replaced and displayed in the determined language by using a mixed bilingual model corresponding to the language of the voice data to be recognized.
- the multilingual acoustic model is established based on a neural network, and the device may also include the following modules:
- the multilingual acoustic model generation module is used to generate a multilingual acoustic model based on the fusion of shared hidden layers of multiple mixed bilingual models.
- the multilingual acoustic model generation module is specifically used to divide the hidden layers of multiple mixed bilingual models into a bottom hidden layer and a high-level hidden layer according to a preset ratio, and merge the bottom hidden layers to generate a shared hidden layer
- the multiple mixed bilingual models are neural networks including multiple hidden layers
- the high-level hidden layers of the multiple mixed bilingual models have language characteristics corresponding to each mixed bilingual model
- add a hidden layer of a preset language model to the output layer of the shared hidden layer
- a plurality of output layers of the high-level hidden layer of language features are merged as the input layer of the preset language classification model to construct a preset language classification model
- the high-level hidden layers of the multiple mixed bilingual models are respectively independently formed into a mixed output layer, so as to use the shared hidden layer, the hidden layer of the preset language model, the high-level hidden layer, the preset language classification model and the mixed output layer to generate a multilingual acoustic model.
- the present application also provides a vehicle-mounted terminal, including:
- the multilingual recognition device comprising the above-mentioned voice, a processor, a memory, and a computer program stored on the memory and capable of running on the processor, when the computer program is executed by the processor, realizes the various processes of the above-mentioned voice-based multilingual recognition method, and can achieve the same technical effect. To avoid repetition, details are not repeated here.
- the present application also provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the various processes of the above-mentioned multilingual recognition method for speech can be realized, and the same technical effect can be achieved. In order to avoid repetition, details are not repeated here.
- the examples of the present application may be provided as a method, an apparatus, or a computer program product. Accordingly, the examples of the present application may take the form of entirely hardware, entirely software, or a combination of software and hardware aspects. Furthermore, the present examples may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
- computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
- These computer program instructions can also be stored in a computer-readable memory capable of directing a computer or other programmable data processing terminal equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product comprising instruction means, and the instruction means implements the functions specified in one or more flows of the flowchart and/or one or more blocks of the block diagram.
- These computer program instructions can also be loaded on a computer or other programmable data processing terminal equipment, so that a series of operation steps are executed on the computer or other programmable terminal equipment to generate computer-implemented processing, so that the instructions executed on the computer or other programmable terminal equipment provide steps for realizing the functions specified in one or more procedures of the flow chart and/or one or more blocks of the block diagram.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
La présente invention concerne un procédé et un appareil de reconnaissance multilingue pour une parole, ainsi qu'un terminal et un support de stockage. Le procédé consiste à : acquérir des données de parole devant être soumises à une reconnaissance et un modèle acoustique multilingue, le modèle acoustique multilingue étant obtenu au moyen de la réalisation d'une fusion sur la base d'une couche cachée partagée d'une pluralité de modèles bilingues hybrides (201) ; obtenir un niveau de confiance pour chaque langue selon lesdites données de parole et le modèle acoustique multilingue (202) ; et déterminer, sur la base du niveau de confiance pour chaque langue, une langue correspondant auxdites données de parole (203).
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210058785.2 | 2022-01-19 | ||
CN202210058785.2A CN114078468B (zh) | 2022-01-19 | 2022-01-19 | 语音的多语种识别方法、装置、终端和存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023138286A1 true WO2023138286A1 (fr) | 2023-07-27 |
Family
ID=80284692
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/140282 WO2023138286A1 (fr) | 2022-01-19 | 2022-12-20 | Procédé et appareil de reconnaissance multilingue pour une parole, et terminal et support de stockage |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114078468B (fr) |
WO (1) | WO2023138286A1 (fr) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114078468B (zh) * | 2022-01-19 | 2022-05-13 | 广州小鹏汽车科技有限公司 | 语音的多语种识别方法、装置、终端和存储介质 |
CN116386609A (zh) * | 2023-04-14 | 2023-07-04 | 南通大学 | 一种中英混合语音识别方法 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120203540A1 (en) * | 2011-02-08 | 2012-08-09 | Microsoft Corporation | Language segmentation of multilingual texts |
CN103400577A (zh) * | 2013-08-01 | 2013-11-20 | 百度在线网络技术(北京)有限公司 | 多语种语音识别的声学模型建立方法和装置 |
US20170140759A1 (en) * | 2015-11-13 | 2017-05-18 | Microsoft Technology Licensing, Llc | Confidence features for automated speech recognition arbitration |
CN110634487A (zh) * | 2019-10-24 | 2019-12-31 | 科大讯飞股份有限公司 | 一种双语种混合语音识别方法、装置、设备及存储介质 |
CN110895932A (zh) * | 2018-08-24 | 2020-03-20 | 中国科学院声学研究所 | 基于语言种类和语音内容协同分类的多语言语音识别方法 |
CN112185348A (zh) * | 2020-10-19 | 2021-01-05 | 平安科技(深圳)有限公司 | 多语种语音识别方法、装置及电子设备 |
CN114078468A (zh) * | 2022-01-19 | 2022-02-22 | 广州小鹏汽车科技有限公司 | 语音的多语种识别方法、装置、终端和存储介质 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108615525B (zh) * | 2016-12-09 | 2020-10-09 | 中国移动通信有限公司研究院 | 一种语音识别方法及装置 |
CN107240395B (zh) * | 2017-06-16 | 2020-04-28 | 百度在线网络技术(北京)有限公司 | 一种声学模型训练方法和装置、计算机设备、存储介质 |
CN112489622B (zh) * | 2019-08-23 | 2024-03-19 | 中国科学院声学研究所 | 一种多语言连续语音流语音内容识别方法及系统 |
CN110930980B (zh) * | 2019-12-12 | 2022-08-05 | 思必驰科技股份有限公司 | 一种中英文混合语音的声学识别方法及系统 |
CN111753557B (zh) * | 2020-02-17 | 2022-12-20 | 昆明理工大学 | 融合emd最小化双语词典的汉-越无监督神经机器翻译方法 |
-
2022
- 2022-01-19 CN CN202210058785.2A patent/CN114078468B/zh active Active
- 2022-12-20 WO PCT/CN2022/140282 patent/WO2023138286A1/fr unknown
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120203540A1 (en) * | 2011-02-08 | 2012-08-09 | Microsoft Corporation | Language segmentation of multilingual texts |
CN103400577A (zh) * | 2013-08-01 | 2013-11-20 | 百度在线网络技术(北京)有限公司 | 多语种语音识别的声学模型建立方法和装置 |
US20170140759A1 (en) * | 2015-11-13 | 2017-05-18 | Microsoft Technology Licensing, Llc | Confidence features for automated speech recognition arbitration |
CN110895932A (zh) * | 2018-08-24 | 2020-03-20 | 中国科学院声学研究所 | 基于语言种类和语音内容协同分类的多语言语音识别方法 |
CN110634487A (zh) * | 2019-10-24 | 2019-12-31 | 科大讯飞股份有限公司 | 一种双语种混合语音识别方法、装置、设备及存储介质 |
CN112185348A (zh) * | 2020-10-19 | 2021-01-05 | 平安科技(深圳)有限公司 | 多语种语音识别方法、装置及电子设备 |
CN114078468A (zh) * | 2022-01-19 | 2022-02-22 | 广州小鹏汽车科技有限公司 | 语音的多语种识别方法、装置、终端和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN114078468B (zh) | 2022-05-13 |
CN114078468A (zh) | 2022-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7127106B2 (ja) | 質問応答処理、言語モデルの訓練方法、装置、機器および記憶媒体 | |
WO2023138286A1 (fr) | Procédé et appareil de reconnaissance multilingue pour une parole, et terminal et support de stockage | |
JP7150770B2 (ja) | 対話方法、装置、コンピュータ可読記憶媒体、及びプログラム | |
CN113762322B (zh) | 基于多模态表示的视频分类方法、装置和设备及存储介质 | |
CN112528637B (zh) | 文本处理模型训练方法、装置、计算机设备和存储介质 | |
CN112836487B (zh) | 一种自动评论方法、装置、计算机设备及存储介质 | |
WO2021179701A1 (fr) | Procédé et appareil de reconnaissance vocale multilingue, et dispositif électronique | |
JP7430820B2 (ja) | ソートモデルのトレーニング方法及び装置、電子機器、コンピュータ可読記憶媒体、コンピュータプログラム | |
CN111666409B (zh) | 一种基于综合深度胶囊网络的复杂评论文本的整体情感智能分类方法 | |
US20230395075A1 (en) | Human-machine dialogue system and method | |
US20220335079A1 (en) | Method for generating virtual image, device and storage medium | |
CN115495568B (zh) | 一种对话模型的训练方法及装置、对话响应方法及装置 | |
JP2018073411A (ja) | 自然言語の生成方法、自然言語の生成装置及び電子機器 | |
EP4148727A1 (fr) | Procédé et appareil de reconnaissance de la parole et de codec, dispositif électronique et support d'enregistrement | |
CN110782869A (zh) | 语音合成方法、装置、系统和存储介质 | |
WO2024045475A1 (fr) | Procédé et appareil de reconnaissance vocale, dispositif et support | |
WO2024066920A1 (fr) | Procédé et appareil de traitement pour dialogue dans une scène virtuelle, dispositif électronique, produit de programme informatique et support de stockage informatique | |
US20230094730A1 (en) | Model training method and method for human-machine interaction | |
CN110263218A (zh) | 视频描述文本生成方法、装置、设备和介质 | |
CN111126084B (zh) | 数据处理方法、装置、电子设备和存储介质 | |
CN116258147A (zh) | 一种基于异构图卷积的多模态评论情感分析方法及系统 | |
CN113360683B (zh) | 训练跨模态检索模型的方法以及跨模态检索方法和装置 | |
WO2024159858A1 (fr) | Procédé et appareil d'entraînement de modèle de reconnaissance d'entité, dispositif, support de stockage et produit | |
WO2022100071A1 (fr) | Procédé et appareil de regroupement de texte vocal | |
CN116778040B (zh) | 基于口型的人脸图像生成方法、模型的训练方法以及设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22921707 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |