WO2023138286A1 - Multi-language recognition method and apparatus for speech, and terminal and storage medium - Google Patents

Multi-language recognition method and apparatus for speech, and terminal and storage medium Download PDF

Info

Publication number
WO2023138286A1
WO2023138286A1 PCT/CN2022/140282 CN2022140282W WO2023138286A1 WO 2023138286 A1 WO2023138286 A1 WO 2023138286A1 CN 2022140282 W CN2022140282 W CN 2022140282W WO 2023138286 A1 WO2023138286 A1 WO 2023138286A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
recognized
model
mixed
bilingual
Prior art date
Application number
PCT/CN2022/140282
Other languages
French (fr)
Chinese (zh)
Inventor
张辽
Original Assignee
广州小鹏汽车科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州小鹏汽车科技有限公司 filed Critical 广州小鹏汽车科技有限公司
Publication of WO2023138286A1 publication Critical patent/WO2023138286A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • the present application relates to the field of computer technology, and in particular to a multilingual voice recognition method, a corresponding multilingual voice recognition device, a corresponding vehicle-mounted terminal and a computer-readable storage medium.
  • the speech data to be recognized may not only be speech in a single language, but may also be bilingual mixed speech or multilingual mixed speech.
  • the speech data to be recognized may not only be speech in a single language, but may also be bilingual mixed speech or multilingual mixed speech.
  • the first aspect of the present application provides a multilingual voice recognition method, including:
  • the multilingual acoustic model is obtained based on the fusion of shared hidden layers of multiple mixed bilingual models
  • the language corresponding to the voice data to be recognized is determined based on the confidence for each language.
  • the second aspect of the present application provides a multilingual recognition device, including:
  • the multilingual acoustic model acquisition module is used to obtain the speech data to be recognized and the multilingual acoustic model; the multilingual acoustic model is obtained based on the shared hidden layer fusion of multiple mixed bilingual models;
  • a confidence level generation module configured to obtain confidence levels for each language according to the speech data to be recognized and the multilingual acoustic model
  • the language recognition module is configured to determine the language corresponding to the speech data to be recognized based on the confidence for each language.
  • the third aspect of the present application provides a vehicle-mounted terminal, including: the processor, a memory, and a computer program stored on the memory and capable of running on the processor.
  • the computer program is executed by the processor, any one of the steps of multilingual recognition of the voice is realized.
  • the fourth aspect of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, any one of the steps of the multilingual acoustic model-based language recognition method or any one of the steps of the multilingual recognition of speech is implemented.
  • the speech data to be recognized is recognized by using the multilingual acoustic model generated by the fusion of shared hidden layers based on multiple mixed bilingual models to obtain the confidence for each language, and the language corresponding to the speech data to be recognized is determined based on the obtained confidence to complete the multilingual recognition of the speech.
  • the multilingual acoustic model based on the fusion of the shared hidden layers of multiple mixed bilingual models is used to recognize the multilingual speech. Based on the shared hidden layer in the model, the amount of calculation in the traditional multilingual recognition model is reduced, the efficiency of language recognition is improved, and the user experience is improved.
  • Fig. 1 is a model schematic diagram of a multilingual acoustic model in the related art
  • Fig. 2 is the flow chart of the steps of the multilingual recognition method of speech provided by the present application
  • Fig. 3 is a model schematic diagram of the multilingual acoustic model provided by the present application.
  • Fig. 4 is a schematic diagram of the application of the multilingual acoustic model provided by the present application.
  • Fig. 5 is a structural block diagram of the multilingual speech recognition method and device provided by the present application.
  • first, second, third and so on may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another.
  • first information may also be called second information, and similarly, second information may also be called first information.
  • second information may also be called first information.
  • a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features.
  • “plurality” means two or more, unless otherwise specifically defined.
  • the present application provides a multilingual voice recognition method, a corresponding multilingual voice recognition device, a corresponding vehicle-mounted terminal, and a computer-readable storage medium that overcome the problems mentioned in the background technology or at least partially solve the problems mentioned in the background technology.
  • the speech data to be recognized may not only be speech in a single language, but may also be bilingual mixed speech or multilingual mixed speech.
  • the speech data to be recognized may not only be speech in a single language, but may also be bilingual mixed speech or multilingual mixed speech.
  • regions such as the world, Asia, and Europe
  • POI Point of Interest
  • command word recognition for other major countries and regions. From the perspective of modeling cost and user experience, the multi-language recognition system established needs to meet the characteristics of less resource occupation and fast recognition speed.
  • FIG. 1 a schematic diagram of a multilingual acoustic model in the related art is shown.
  • the preset language widely used in this area is English.
  • English is a widely used language, using, for example, English-German (its neural network layer such as N-layer LSTM (Long short-term memory, long short-term memory network) hidden layer can be used for the English factor feature vector and German factor feature vectors to calculate the softmax score through its mixed output layer), English-French (its neural network layer, such as the M-layer LSTM hidden layer can be used to output English factor feature vectors and French factor feature vectors through its mixed output layer to perform softmax score calculation) and more than 20 sets of mixed bilingual models. Modeling is mainly to use the corresponding language system for mixed bilingual modeling of place names, person names, and institutional names in this region.
  • English-German its neural network layer such as N-layer LSTM (Long short-term memory, long short-term memory network) hidden layer can be used for the English factor feature vector and German factor feature vectors to calculate the softmax score through its mixed output layer
  • English-French its neural network layer, such as the M-layer LSTM hidden layer can be used to output English factor feature
  • each group of mixed bilinguals such as English-German, English-French, etc.
  • multiple mixed bilingual models are obtained.
  • the language corresponding to the speech data to be recognized can be determined.
  • this language recognition method based on the output scores of multiple sets of acoustic models involves modeling more than 20 sets of mixed bilingual models such as English-German and English-French. Reducing the size of the acoustic model requires the use of a CPU (Central Processing Unit, central processing unit) with stronger performance. The reduction in the size of the acoustic model is manifested in the reduction of the dimension of the feature vector and the reduction of the number of neural network layers. Reducing the selection of features by the neural network in each layer of the acoustic model will lead to poor recognition of the model, and the scoring PK for multiple groups of language scores. For user experience, the frequent changes in the results on the screen and the time-consuming score PK model will increase the delay on the screen, which cannot meet the needs of less resources and recognition speed. Faster requirements affect the user's experience on the screen.
  • CPU Central Processing Unit, central processing unit
  • FIG. 2 shows a flow chart of the steps of the multilingual recognition method of speech provided by the present application, which may specifically include the following steps:
  • Step 201 acquiring speech data to be recognized and a multilingual acoustic model, wherein the multilingual acoustic model is fused based on shared hidden layers of multiple mixed bilingual models;
  • multilingual speech recognition can be performed by using a multilingual acoustic model based on the fusion of shared hidden layers of multiple mixed bilingual models. Based on the shared hidden layer in the model, the amount of calculation in the traditional multilingual recognition model can be reduced, and the efficiency of language recognition can be improved.
  • the multilingual acoustic model used in this application can be constructed before obtaining the multilingual acoustic model based on the fusion of the shared hidden layers of multiple mixed bilingual models.
  • one of the core ideas of the multilingual acoustic model constructed by the present application is to combine the underlying hidden layers of multiple groups of mixed bilingual models to generate a shared hidden layer, and reduce the memory consumption of the built multilingual acoustic model based on the merged shared hidden layers; and in the process of using the mixed bilingual model to recognize the speech data to be recognized, the language classification process is added based on the preset language classification model, and the output results of each high-level cache layer are cached before the language recognition is determined, so that the corresponding language of the speech data can be subsequently displayed on the screen. It can be displayed based on the mixed bilingual model with the determined language, reducing the amount of calculation of the multilingual acoustic model.
  • the multilingual acoustic model can be generated by using the shared hidden layer, the hidden layer of the preset language model, the high-level hidden layer, the preset language classification model and the constructed mixed output layer.
  • each mixed bilingual model has a neural network with multiple hidden layers.
  • the characteristic vector dimension, such as pauses and syllable lengths that are common among different languages, the multi-layer hidden layer of the mixed bilingual model can also include a hidden layer with its own language characteristics.
  • the hidden layer with parameter commonality can be called the bottom hidden layer
  • the hidden layer with obvious language characteristics can be called the high-level hidden layer.
  • each group of mixed bilinguals can be constructed based on the preset languages widely used in this region and other languages, and is not limited to English-German, English-French and other acoustic models.
  • FIG. 3 it shows a schematic diagram of the multilingual acoustic model provided by the present application.
  • multiple mixed bilingual models such as English-German, English-French and other mixed bilingual models, can divide the hidden layers contained in multiple sets of mixed bilingual models based on whether they have obvious language characteristics.
  • each mixed bilingual model has 80% of the underlying hidden layer and 20% of the high-level hidden layer.
  • Divide merge the underlying hidden layers into shared hidden layers, and retain the high-level hidden layers that obviously have the characteristics of a specific language family, based on the introduction of the underlying hidden layers, improve the hardware adaptation of the multilingual acoustic model built on the device.
  • the output result of the merged shared hidden layer can be used as the input item of each retained high-level hidden layer, so that the voice data can be displayed in the corresponding language based on the mixed output layer with the determined language when outputting the high-level hidden layer. While reducing memory consumption and calculation load, the modeling accuracy of the multilingual acoustic model can be improved based on the reservation of the high-level hidden layer.
  • the hidden layer of the preset language model can also be added to the output layer of the shared hidden layer, so that the output result of the shared hidden layer can be used as the input item of the preset language model, so that the voice data can be displayed on the screen according to the preset language before the language is determined, reducing the time delay of the upper screen display, and the frequent language changes of the upper screen display results, and improving the user's screen experience.
  • the preset language can refer to a language that is widely used in a certain region. For example, for the recognition of user voice languages in European countries, English is a language widely used in Europe. At this time, the hidden layer of the preset English model can be added to the output layer of the shared hidden layer, so that the multilingual acoustic model can be displayed on the screen in English and provide a bottom-up effect.
  • the method of determining the speech language can be realized by introducing a preset language classification model.
  • the high-level hidden layers of multiple mixed bilingual models have language characteristics corresponding to each mixed bilingual model.
  • multiple output layers of the high-level hidden layers with language characteristics corresponding to each mixed bilingual model can be combined as the input layer of the preset language classification model to build a preset language classification model.
  • the language features used to train the language classification model mainly have high-level abstract features.
  • this high-level abstract feature is based on the output of the bottom hidden layer and high-level hidden layer of multiple mixed bilingual models. Since the neural network in the model can directly use language features that already have high-level abstract features, there is no need to prepend a large hidden layer for feature extraction.
  • the number of neural network layers guarantees the low latency and low computational load of the language classification model, and at the same time, the splicing of high-level abstract features based on the calculation cache ensures the high recognition effect of the model.
  • the independently formed mixed output layer can be used to decode the language in the speech data to be recognized and display the corresponding language on the screen.
  • the output results of the high-level hidden layers of multiple mixed bilingual models can be cached.
  • the mixed output layer of the bilingual model can be processed, that is, the mixed output layer formed independently by the high-level hidden layers of multiple mixed bilingual models can be softmaxed after the corresponding language is determined based on the preset language classification model.
  • the output results of the high-level hidden layers of multiple mixed bilingual models are cached to ensure that the softmax calculation is not performed on the output results of the high-level hidden layers before the language is determined.
  • the softmax calculation performed is a tool in machine learning. It can be used to calculate the proportion of each value in a set of values, so as to determine the similarity of each word in the voice data based on the calculated proportion, and screen the words for display on the screen.
  • the hidden layer output i.e. the output result of the high-level hidden layer
  • the last layer softmax i.e. the constructed mixed output layer
  • the mixed output layer of each mixed bilingual model needs to perform softmax calculation.
  • the mixed output layer of the mixed bilingual model corresponding to the language can perform softmax calculation on the speech data to be recognized.
  • multiple mixed bilingual models are neural networks including multiple layers of hidden layers, which adopt the LSTM structure.
  • the multilingual acoustic model constructed based on the shared hidden layer and the high-level hidden layer in each mixed bilingual model also adopts the LSTM structure.
  • the dimension of the hidden layer in the model is 512
  • the multilingual acoustic model constructed is suitable for cloud storage.
  • Step 202 according to the voice data to be recognized and the multilingual acoustic model, the confidence level for each language is obtained;
  • the multilingual acoustic model can be used to identify the speech data to be recognized, and obtain the confidence for each language, so that the language corresponding to the speech data to be recognized can be determined based on the obtained confidence.
  • the hidden layers of multiple mixed bilingual models include a bottom hidden layer and a high-level hidden layer according to a preset ratio, and the bottom hidden layer is used to merge to generate a shared hidden layer for building a multilingual acoustic model.
  • the voice data to be recognized can be passed through the shared hidden layers of multiple mixed bilingual models to obtain a first output result without obvious language characteristics, and then the first output results are respectively output to multiple high-level hidden layers of multiple mixed bilingual models to obtain second output results with obvious language characteristics, and then the obtained multiple second outputs can be used As a result, using it as an input item of the preset language classification model results in multiple confidence levels for each language.
  • the shared hidden layer may refer to hidden layers with common parameters in different mixed bilingual models, such as hidden layers for features such as pauses and syllable lengths that are common among different languages.
  • the high-level hidden layer can refer to the hidden layer with obvious language characteristics in multiple mixed bilingual models.
  • they can respectively carry the language characteristics for each specific language family.
  • This output result can be used for language recognition.
  • the output results of the high-level hidden layers of multiple mixed bilingual models that is, multiple second output results, can be temporarily cached to ensure that the softmax calculation is not performed on the output results of the high-level hidden layers before the language is determined. That is, the mixed output layer calculation of each mixed bilingual model can be suspended at this time. In order to achieve the purpose of reducing the calculation amount of the model.
  • the cached multiple second output results need to be softmax calculated when the recognition language is determined.
  • the introduced preset language model such as the hidden layer of the English model, it can ensure that the on-screen display cannot be displayed on the screen in real time due to the absence of softmax. Without softmax, real-time screen uploading will affect the experience.
  • a preset language classification model In order to determine the language of the voice data to be recognized, a preset language classification model can be introduced.
  • the output results of the high-level hidden layer carry obvious language characteristics for the specific language family.
  • multiple output layers of the high-level hidden layer of multiple mixed bilingual models can be used as input layers for training the preset language classification model to construct a preset language classification model, so that when determining the language of the voice data, the corresponding language can be determined through the confidence of the preset language classification model for each language classification.
  • the high-level hidden layers of each mixed bilingual model have language features corresponding to each mixed bilingual model, so the output result of the high-level hidden layer has an obvious language color.
  • the multi-dimensional feature vector used to represent the second output result can be spliced according to the corresponding dimensions.
  • M-layer convolutional layer conformer this convolutional layer is used to calculate the softmax score of the language to obtain the confidence for each language.
  • the spliced language features are high-level abstract features.
  • multiple confidence levels for different languages can be output in a short period of time without the need to recognize the complete voice request audio.
  • the output confidence levels can be used to judge real-time languages, which can ensure the real-time performance of the speech recognition system when making hybrid model decisions and the accuracy of language classification.
  • the language features used to train the language classification model mainly have high-level abstract features. As shown in Figure 3, this high-level abstract feature is based on the output of the bottom hidden layer and high-level hidden layer of multiple mixed bilingual models. , in the case of low latency and low calculation of the language classification model based on fewer neural network layers, and high recognition effect of the model based on splicing of high-level abstract features.
  • Step 203 Determine the language corresponding to the voice data to be recognized based on the confidence level for each language.
  • the speech data to be recognized can be decoded in real time, and the words of continuous preset length frames obtained by real-time decoding can be input into the language classification model to determine the language result of the real-time speech segment.
  • the words of continuous preset length frames obtained by real-time decoding can be input into the language classification model, and the confidence of the words of the continuous preset length frames for each language is obtained through the language softmax calculation of the language classification model, and the language corresponding to the speech data to be recognized is determined based on the confidence.
  • the confidence level for each language can be used to represent the recognition possibility of the voice data to be recognized and each language, then when determining the language of the voice data, the language corresponding to the voice data to be recognized can be determined based on the multiple confidence levels of each language through the preset language classification model. Specifically, the real-time language result for the input word may be determined based on the judgment result of the confidence level and the preset value.
  • the preset value may be a confidence threshold for each language, which is not limited in this embodiment of the present invention.
  • the language recognition of the 2nd to 5th frames of the speech data to be recognized can achieve fast and accurate language recognition with low density
  • there is no timeout that is, no more than 5 words.
  • the preset length frames are continuous, for example, the language classification confidence of each word in 5 consecutive frames for a certain language (i.e.
  • the maximum softmax score of the 21st dimension exceeds the confidence threshold of 0.8, it means that the language recognition of the speech data to be recognized is over, otherwise it needs to continue Judgment is made on the language of the latest 5 consecutive frames of speech data in the speech data to be recognized; if the speech data to be recognized has been decoded and the words have exceeded 5 words, assuming continuous preset length frames, for example, the confidence of each word in five consecutive frames for each language has not reached the confidence threshold of each language, that is, the confidence standard has not been reached. At this time, the language result with the highest confidence score can be determined from the speech data of the last 5 frames as the final language classification result.
  • the voice data to be recognized can also be decoded, and the decoded voice data can be displayed in real time.
  • the output results can be cached to ensure that the softmax calculation is not performed on the output results of the high-level hidden layers before the language is determined, that is, the mixed output layer calculation of each mixed bilingual model can be suspended at this time, and no softmax calculation is performed before the language is determined.
  • the output layer of the containing layer can increase the hidden layer of the preset language model, that is, when the output results of the merged shared hidden layer, that is, the first output results are respectively input into the high-level hidden layers of multiple mixed bilingual models, the first output result can also be input into the hidden layer of the preset language model to obtain the third output result, so that before the language corresponding to the speech data to be recognized is determined, the third output result can be used to display the speech data to be recognized in the preset language.
  • the preset language can refer to a language that is widely used in a certain region.
  • English is a language widely used in the European region.
  • the hidden layer of the preset English model can be added to the output layer of the shared hidden layer, so that the multilingual acoustic model can be displayed on the screen in English and at the same time provide a bottom-up effect.
  • the language replacement operation can be performed on the results on the screen after the language is determined.
  • the high-level hidden layers of a plurality of mixed bilingual models independently form a mixed output layer.
  • a mixed bilingual model corresponding to the language of the speech data to be recognized can be used to replace and display the displayed speech information with the language corresponding to the speech data to be recognized.
  • the output result of the high-level hidden layer of the mixed bilingual model corresponding to the language that is cached can be used.
  • the softmax calculation can be performed for speech The output of information replaces the information displayed on the screen in English before, so as to realize the low time consumption and delay of the screen display and improve the user experience.
  • the results displayed on the upper screen can be replaced by corresponding languages according to the softmax calculation of the mixed bilingual model that matches the real-time language classification results.
  • the voice data to be recognized may be mixed language audio, such as English + local language audio (assuming English + French).
  • the corresponding bilingual mixed acoustic model can be activated, and this model is used to calculate the softmax of the previously cached cache output layer to realize the recognition of each word, and replace the recognized French word with the word in the upper screen result.
  • the speech data to be recognized is recognized to obtain the confidence degree for each language, and the language corresponding to the speech data to be recognized is determined based on the obtained confidence degree, and the multilingual recognition of the speech is completed.
  • the multilingual acoustic model based on the fusion of the shared hidden layer of multiple mixed bilingual models is used to recognize the multilingual speech. Based on the shared hidden layer in the model, the amount of calculation in the traditional multilingual recognition model is reduced, the efficiency of language recognition is improved, and the user experience is improved.
  • FIG. 4 it shows a schematic diagram of the application of the multilingual acoustic model provided by the present application.
  • the constructed multilingual acoustic model can be applied to the scene of recognizing the user's personalized voice, and the corresponding acoustic language working mechanism can be divided into two stages of decoding and displaying on the screen.
  • the streamed screen results can be decoded to ensure user experience, and at the same time, the output of the hidden layer can be cached based on the multilingual acoustic model for language determination;
  • the mixed output layer composed of multiple layers performs softmax calculation, so that the result of softmax calculation is used to replace the result on the screen, and at the same time, the language model of the corresponding language is called for normal decoding.
  • the user's speech recognition can be manifested as in the process of decoding the speech data to be recognized, based on the user's IP (Internet Protocol, communication protocol) address, the resource information of the city where the corresponding IP address is located and the language information determined based on the multilingual acoustic model can be called to improve the recognition rate of the speech data to be recognized.
  • the resource can refer to the additional Ngram model (an algorithm based on statistical language model) trained by place names.
  • the general neural network NNLM “Nerual Network Language Model) can be obtained based on the text training related to the place name of the POI (Point of Interest) place name of the entire country corresponding to the recognized language.
  • the personalized city-level model compared with the general neural network NNLM, is mainly based on the POI data of the place name text training of the corresponding city ( A small amount), and in consideration of the amount of calculation and storage, the size of the personalized city-level model is small, and the construction of the personalized language model is completed.
  • the multilingual acoustic model can be used to build a language model based on the language of the user's voice data to be recognized and the user's resource information, so as to comprehensively use the user's resource information to realize the recognition of the user's language and improve the accuracy of language recognition.
  • FIG. 5 shows a structural block diagram of the multilingual recognition device for speech provided by the present application, which may specifically include the following modules:
  • the multilingual acoustic model acquisition module 501 is used to acquire speech data to be recognized and a multilingual acoustic model; the multilingual acoustic model is obtained based on the fusion of shared hidden layers of multiple mixed bilingual models;
  • Confidence degree generation module 502 for obtaining the confidence degree for each language according to the speech data to be recognized and the multilingual acoustic model
  • the language recognition module 503 is configured to determine the language corresponding to the voice data to be recognized based on the confidence for each language.
  • described device can also comprise following module:
  • the upper screen display module is used to decode the speech data to be recognized, and display the decoded speech data in real time.
  • the hidden layers of the plurality of mixed bilingual models include a bottom layer hidden layer and a high layer hidden layer according to a preset ratio, and the bottom layer hidden layer is used for merging to generate a shared hidden layer;
  • the confidence generation module 502 may include the following submodules:
  • the first output result generation sub-module is used to input the voice data to be recognized into the shared hidden layer of multiple mixed bilingual models to obtain the first output result;
  • the second output result generation sub-module is used to input the first output results into the high-level hidden layers of the multiple mixed bilingual models respectively to obtain multiple second output results;
  • the confidence generation sub-module is used to combine the multiple second output results as input items of the preset language classification model to obtain multiple confidence levels for each language.
  • the confidence generation submodule is specifically used to splice the multi-dimensional feature vectors used to characterize the second output results according to corresponding dimensions, and use the spliced feature vectors as input items of the preset language classification model to obtain multiple confidence levels for different languages.
  • the language recognition module 503 is specifically used to determine that the language corresponding to the confidence is the language corresponding to the voice data to be recognized when there is only one confidence value greater than a preset value; or, when there are two or more confidence values greater than a preset value, determine that the corresponding language with the largest confidence value is the language corresponding to the voice data to be recognized; The language of the voice data to be recognized.
  • the upper-screen display module is specifically used for decoding the voice data to be recognized and displaying the decoded voice data in a preset language before determining the language corresponding to the voice data to be recognized; and after determining the language corresponding to the voice data to be recognized, using a mixed bilingual model corresponding to the determined language to decode the voice data to be recognized, and continue to perform replacement display of the determined language on the decoded voice data.
  • the multilingual acoustic model includes a preset language model, and when the upper screen display module displays the preset language, the third output result is obtained by inputting the first output result into the hidden layer of the preset language model; wherein the preset language model is located at the output layer of the shared hidden layer;
  • the third output result is decoded to obtain recognized voice information, and displayed in a preset language.
  • the high-level hidden layers of the plurality of mixed bilingual models independently form a mixed output layer, and when the upper-screen display module performs the replacement display of the determined language, after determining the language corresponding to the voice data to be recognized, the displayed voice information is replaced and displayed in the determined language by using a mixed bilingual model corresponding to the language of the voice data to be recognized.
  • the multilingual acoustic model is established based on a neural network, and the device may also include the following modules:
  • the multilingual acoustic model generation module is used to generate a multilingual acoustic model based on the fusion of shared hidden layers of multiple mixed bilingual models.
  • the multilingual acoustic model generation module is specifically used to divide the hidden layers of multiple mixed bilingual models into a bottom hidden layer and a high-level hidden layer according to a preset ratio, and merge the bottom hidden layers to generate a shared hidden layer
  • the multiple mixed bilingual models are neural networks including multiple hidden layers
  • the high-level hidden layers of the multiple mixed bilingual models have language characteristics corresponding to each mixed bilingual model
  • add a hidden layer of a preset language model to the output layer of the shared hidden layer
  • a plurality of output layers of the high-level hidden layer of language features are merged as the input layer of the preset language classification model to construct a preset language classification model
  • the high-level hidden layers of the multiple mixed bilingual models are respectively independently formed into a mixed output layer, so as to use the shared hidden layer, the hidden layer of the preset language model, the high-level hidden layer, the preset language classification model and the mixed output layer to generate a multilingual acoustic model.
  • the present application also provides a vehicle-mounted terminal, including:
  • the multilingual recognition device comprising the above-mentioned voice, a processor, a memory, and a computer program stored on the memory and capable of running on the processor, when the computer program is executed by the processor, realizes the various processes of the above-mentioned voice-based multilingual recognition method, and can achieve the same technical effect. To avoid repetition, details are not repeated here.
  • the present application also provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the various processes of the above-mentioned multilingual recognition method for speech can be realized, and the same technical effect can be achieved. In order to avoid repetition, details are not repeated here.
  • the examples of the present application may be provided as a method, an apparatus, or a computer program product. Accordingly, the examples of the present application may take the form of entirely hardware, entirely software, or a combination of software and hardware aspects. Furthermore, the present examples may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory capable of directing a computer or other programmable data processing terminal equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product comprising instruction means, and the instruction means implements the functions specified in one or more flows of the flowchart and/or one or more blocks of the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing terminal equipment, so that a series of operation steps are executed on the computer or other programmable terminal equipment to generate computer-implemented processing, so that the instructions executed on the computer or other programmable terminal equipment provide steps for realizing the functions specified in one or more procedures of the flow chart and/or one or more blocks of the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

A multi-language recognition method and apparatus for a speech, and a terminal and a storage medium. The method comprises: acquiring speech data to be subjected to recognition and a multi-language acoustic model, wherein the multi-language acoustic model is obtained by means of performing fusion on the basis of a shared hidden layer of a plurality of hybrid bilingual models (201); obtaining a confidence level for each language according to said speech data and the multi-language acoustic model (202); and determining, on the basis of the confidence level for each language, a language corresponding to said speech data (203).

Description

语音的多语种识别方法、装置、终端和存储介质Multilingual voice recognition method, device, terminal and storage medium
本申请要求于2022年01月19日提交国家知识产权局、申请号为202210058785.2、申请名称为“语音的多语种识别方法、装置、终端和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office on January 19, 2022, with application number 202210058785.2, and application name "Multilingual Voice Recognition Method, Device, Terminal, and Storage Medium", the entire contents of which are incorporated in this application by reference.
技术领域technical field
本申请涉及计算机技术领域,尤其涉及一种语音的多语种识别方法、相应的一种语音的多语种识别装置、相应的一种车载终端和计算机可读存储介质。The present application relates to the field of computer technology, and in particular to a multilingual voice recognition method, a corresponding multilingual voice recognition device, a corresponding vehicle-mounted terminal and a computer-readable storage medium.
背景技术Background technique
随着人工智能相关技术的日益成熟,越来越多的智能设备进入用户的生活中,人与机器的交互日渐平常。语音输入作为人机交互中自然又便捷的交互方式,实现解放双手的目的,目前的智能设备大多具有语音识别功能,语音识别功能提高用户的便捷性。目前,待识别的语音数据可能并不只是单一语种的语音,还可能为双语种的混合语音或多语种的混合语音,针对多种混合多语种识别模型的构建,主要可以是通过分别对各组混合双语种,例如英德、英法等的声学模型进行建模,基于多组声学模型输出得分的语种识别方式实现,这种语种识别方式所要求的计算量巨大,所进行的语种识别效率低。With the increasing maturity of artificial intelligence-related technologies, more and more smart devices have entered the lives of users, and the interaction between humans and machines has become increasingly common. As a natural and convenient way of human-computer interaction, voice input achieves the purpose of freeing hands. Most of the current smart devices have voice recognition function, which improves the convenience of users. At present, the speech data to be recognized may not only be speech in a single language, but may also be bilingual mixed speech or multilingual mixed speech. For the construction of multiple mixed multilingual recognition models, it is mainly possible to model the acoustic models of each group of mixed bilingual languages, such as English, German, English and French, and realize the language recognition based on the output scores of multiple sets of acoustic models. This language recognition method requires a huge amount of calculation, and the language recognition efficiency is low.
发明内容Contents of the invention
本申请第一方面提供一种语音的多语种识别方法,包括:The first aspect of the present application provides a multilingual voice recognition method, including:
获取待识别的语音数据和多语种声学模型;所述多语种声学模型基于多个混合双语模型的共享隐含层融合得到;Obtain speech data to be recognized and a multilingual acoustic model; the multilingual acoustic model is obtained based on the fusion of shared hidden layers of multiple mixed bilingual models;
根据所述待识别的语音数据和所述多语种声学模型,得到针对各语种的置信度;Obtaining confidence levels for each language according to the speech data to be recognized and the multilingual acoustic model;
基于所述针对各语种的置信度确定所述待识别的语音数据对应的语种。The language corresponding to the voice data to be recognized is determined based on the confidence for each language.
本申请第二方面提供一种多语种识别装置,包括:The second aspect of the present application provides a multilingual recognition device, including:
多语种声学模型获取模块,用于获取待识别的语音数据和多语种声学模 型;所述多语种声学模型基于多个混合双语模型的共享隐含层融合得到;The multilingual acoustic model acquisition module is used to obtain the speech data to be recognized and the multilingual acoustic model; the multilingual acoustic model is obtained based on the shared hidden layer fusion of multiple mixed bilingual models;
置信度生成模块,用于根据所述待识别的语音数据和所述多语种声学模型,得到针对各语种的置信度;A confidence level generation module, configured to obtain confidence levels for each language according to the speech data to be recognized and the multilingual acoustic model;
语种识别模块,用于基于所述针对各语种的置信度确定所述待识别的语音数据对应的语种。The language recognition module is configured to determine the language corresponding to the speech data to be recognized based on the confidence for each language.
本申请第三方面提供一种车载终端,包括:所述、处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现任一项所述语音的多语种识别的步骤。The third aspect of the present application provides a vehicle-mounted terminal, including: the processor, a memory, and a computer program stored on the memory and capable of running on the processor. When the computer program is executed by the processor, any one of the steps of multilingual recognition of the voice is realized.
本申请第四方面提供一种计算机可读存储介质,所述计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现任一项所述基于多语种声学模型的语种识别方法的步骤或任一项所述语音的多语种识别的步骤。The fourth aspect of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, any one of the steps of the multilingual acoustic model-based language recognition method or any one of the steps of the multilingual recognition of speech is implemented.
依据本申请提供的语音的多语种识别方法,在本申请中,通过采用基于多个混合双语模型的共享隐含层融合生成的多语种声学模型,对待识别的语音数据进行识别得到针对各个语种的置信度,以基于所得到的置信度确定待识别语音数据对应的语种,完成对该语音的多语种识别。通过基于多个混合双语模型的共享隐含层融合得到的多语种声学模型对语音的多语种进行识别,基于模型中的共享隐含层降低传统多语种识别模型中的计算量,提高对语种识别的效率,进而提升用户体验。According to the multilingual recognition method of speech provided in this application, in this application, the speech data to be recognized is recognized by using the multilingual acoustic model generated by the fusion of shared hidden layers based on multiple mixed bilingual models to obtain the confidence for each language, and the language corresponding to the speech data to be recognized is determined based on the obtained confidence to complete the multilingual recognition of the speech. The multilingual acoustic model based on the fusion of the shared hidden layers of multiple mixed bilingual models is used to recognize the multilingual speech. Based on the shared hidden layer in the model, the amount of calculation in the traditional multilingual recognition model is reduced, the efficiency of language recognition is improved, and the user experience is improved.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
附图说明Description of drawings
通过结合附图对本申请示例性实施方式进行更详细的描述,本申请的上述以及其它目的、特征和优势将变得更加明显,其中,在本申请示例性实施方式中,相同的参考标号通常代表相同部件。The above and other objects, features and advantages of the present application will become more apparent by describing the exemplary embodiments of the present application in more detail with reference to the accompanying drawings, wherein, in the exemplary embodiments of the present application, the same reference numerals generally represent the same components.
图1是相关技术中多语种声学模型的模型示意图;Fig. 1 is a model schematic diagram of a multilingual acoustic model in the related art;
图2是本申请提供的语音的多语种识别方法的步骤流程图;Fig. 2 is the flow chart of the steps of the multilingual recognition method of speech provided by the present application;
图3是本申请提供的多语种声学模型的模型示意图;Fig. 3 is a model schematic diagram of the multilingual acoustic model provided by the present application;
图4是本申请提供的多语种声学模型的应用示意图;Fig. 4 is a schematic diagram of the application of the multilingual acoustic model provided by the present application;
图5是本申请提供的语音的多语种识别方法装置的结构框图。Fig. 5 is a structural block diagram of the multilingual speech recognition method and device provided by the present application.
具体实施方式Detailed ways
下面将参照附图更详细地描述本申请的实施方式。虽然附图中显示了本申请的实施方式,然而应该理解,可以以各种形式实现本申请而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本申请更加透彻和完整,并且能够将本申请的范围完整地传达给本领域的技术人员。Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. Although embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the scope of this application to those skilled in the art.
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in this application is for the purpose of describing particular embodiments only, and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本申请可能采用术语“第一”、“第二”、“第三”等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。It should be understood that although the terms "first", "second", "third" and so on may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present application, first information may also be called second information, and similarly, second information may also be called first information. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of the present application, "plurality" means two or more, unless otherwise specifically defined.
本申请提供一种克服背景技术述及的问题或者至少部分地解决背景技术述及的问题的一种语音的多语种识别方法、相应的一种语音的多语种识别装置、相应的一种车载终端和计算机可读存储介质。The present application provides a multilingual voice recognition method, a corresponding multilingual voice recognition device, a corresponding vehicle-mounted terminal, and a computer-readable storage medium that overcome the problems mentioned in the background technology or at least partially solve the problems mentioned in the background technology.
以下结合附图详细描述本申请的技术方案。The technical solution of the present application will be described in detail below in conjunction with the accompanying drawings.
待识别的语音数据可能并不只是单一语种的语音,还可能为双语种的混合语音或多语种的混合语音,例如全球、亚洲地区、欧洲地区等地域较广的范围内推广产品的场景下,由于某个地区的语种分类较多,例如某个地区存在20种以上不同语种的语言,且不同语种之间的语系差异较大,难以达成统一关于语种识别的建模,且由于某个地区国家的占地面积小国家之间的交流较为频繁,即对于此地区用户而言除了支持本国语言之外,还需要满足对其他主要国家和地区的POI(Point of Interest,兴趣点)和 命令词识别,以及站在建模成本与用户体验的角度考虑,所建立的多种混合语种识别系统需要满足占用资源少以及识别速度快的特点。The speech data to be recognized may not only be speech in a single language, but may also be bilingual mixed speech or multilingual mixed speech. For example, in the scenario of promoting products in a wide range of regions such as the world, Asia, and Europe, because there are many language classifications in a certain region, for example, there are more than 20 languages of different languages in a certain region, and the language families between different languages are quite different. It is difficult to achieve a unified modeling of language recognition. In addition to supporting the national language, it is also necessary to meet the POI (Point of Interest) and command word recognition for other major countries and regions. From the perspective of modeling cost and user experience, the multi-language recognition system established needs to meet the characteristics of less resource occupation and fast recognition speed.
目前,针对多种混合多语种识别模型的构建,参照图1,示出了相关技术中多语种声学模型的模型示意图,对于某个地区的用户语音语种识别,由于无法对不同种类语言同时进行声学建模,假设此地区广泛应用的预设语种为英文,通常可基于英文是应用广泛的语言的考虑,采用例如英-德(其神经网络层例如N层LSTM(Long short-term memory,长短期记忆网络)隐含层可用于对英文因素特征向量和德语因素特征向量进行输出,以通过其混合输出层进行softmax得分计算)、英-法(其神经网络层,例如M层LSTM隐含层可用于对英文因素特征向量和法语因素特征向量进行输出,以通过其混合输出层进行softmax得分计算)等20多套混合双语模型进行建模,主要是将此地区各国地名、人名、机构专名等采用对应语系进行混合双语的建模,此时还可基于较为广泛应用的语种,例如英语对通用命令词进行建模,以保证在其他语种的语言识别不准确的情况下采用英文语种完成指令,提供兜底效果。在对如图1所示的多语种声学模型进行使用的过程中,在获取分别对各组混合双语种,例如英-德、英-法等声学模型进行建模,得到多个混合双语模型后,可基于待识别的语音数据分别在多个混合双语模型的语种得分,确定对待识别语音数据对应的语种。At present, for the construction of a variety of mixed multilingual recognition models, referring to Figure 1, a schematic diagram of a multilingual acoustic model in the related art is shown. For the user speech language recognition in a certain area, since it is impossible to perform acoustic modeling on different languages at the same time, it is assumed that the preset language widely used in this area is English. Usually, based on the consideration that English is a widely used language, using, for example, English-German (its neural network layer such as N-layer LSTM (Long short-term memory, long short-term memory network) hidden layer can be used for the English factor feature vector and German factor feature vectors to calculate the softmax score through its mixed output layer), English-French (its neural network layer, such as the M-layer LSTM hidden layer can be used to output English factor feature vectors and French factor feature vectors through its mixed output layer to perform softmax score calculation) and more than 20 sets of mixed bilingual models. Modeling is mainly to use the corresponding language system for mixed bilingual modeling of place names, person names, and institutional names in this region. In order to ensure that the English language is used to complete the instructions when the language recognition of other languages is inaccurate, providing a bottom-up effect. In the process of using the multilingual acoustic model shown in Figure 1, each group of mixed bilinguals, such as English-German, English-French, etc., is modeled respectively, and multiple mixed bilingual models are obtained. Based on the language scores of the speech data to be recognized in the multiple mixed bilingual models, the language corresponding to the speech data to be recognized can be determined.
然而,这种基于多组声学模型输出得分的语种识别方式,涉及到对英-德、英-法等20多套混合双语模型的建模,其所耗费的内存较大,且对于建模所部署的机器存在很高的要求;且在对英-德、英-法等20多套混合双语模型进行建模,以及采用多组混合双语声学模型计算用户语音请求的语种得分的情况下,在对此多语种声学模型的使用过程中计算量大,此时需要在减少声学模型尺寸的同时需要使用性能更强的CPU(Central Processing Unit,中央处理器),对于声学模型尺寸的减少表现为降低特征向量维度以及神经网络层数的减少,而减少声学模型中各层神经网络对特征的筛选将会导致模型的识别效果变差,以及对多组语种得分所进行的得分PK,对于用户体验方面来说,上屏结果变动频繁且得分PK模型的耗时将会导致上屏的延时增加,并不能满足占用资源少以及识别速度快的要 求,影响用户的上屏体验。However, this language recognition method based on the output scores of multiple sets of acoustic models involves modeling more than 20 sets of mixed bilingual models such as English-German and English-French. Reducing the size of the acoustic model requires the use of a CPU (Central Processing Unit, central processing unit) with stronger performance. The reduction in the size of the acoustic model is manifested in the reduction of the dimension of the feature vector and the reduction of the number of neural network layers. Reducing the selection of features by the neural network in each layer of the acoustic model will lead to poor recognition of the model, and the scoring PK for multiple groups of language scores. For user experience, the frequent changes in the results on the screen and the time-consuming score PK model will increase the delay on the screen, which cannot meet the needs of less resources and recognition speed. Faster requirements affect the user's experience on the screen.
参照图2,示出了本申请提供的语音的多语种识别方法的步骤流程图,具体可以包括如下步骤:Referring to FIG. 2 , it shows a flow chart of the steps of the multilingual recognition method of speech provided by the present application, which may specifically include the following steps:
步骤201,获取待识别的语音数据和多语种声学模型,其中多语种声学模型基于多个混合双语模型的共享隐含层融合得到; Step 201, acquiring speech data to be recognized and a multilingual acoustic model, wherein the multilingual acoustic model is fused based on shared hidden layers of multiple mixed bilingual models;
在本申请中,可通过基于多个混合双语模型的共享隐含层融合得到的多语种声学模型对语音的多语种进行识别,基于模型中的共享隐含层降低传统多语种识别模型中的计算量,提高对语种识别的效率。In this application, multilingual speech recognition can be performed by using a multilingual acoustic model based on the fusion of shared hidden layers of multiple mixed bilingual models. Based on the shared hidden layer in the model, the amount of calculation in the traditional multilingual recognition model can be reduced, and the efficiency of language recognition can be improved.
其中,并不采用多组混合双语模型参与模型的计算与识别过程,在获取基于多个混合双语模型的共享隐含层融合得到的多语种声学模型之前,可以对本申请所采用的多语种声学模型进行构建。Among them, instead of using multiple sets of mixed bilingual models to participate in the calculation and recognition process of the model, the multilingual acoustic model used in this application can be constructed before obtaining the multilingual acoustic model based on the fusion of the shared hidden layers of multiple mixed bilingual models.
具体的,本申请所构建的多语种声学模型,其核心思想之一在于将多组混合双语模型的底层隐含层合并生成共享隐含层,基于所合并的共享隐含层多降低所构建的多语种声学模型对内存的消耗;且在采用混合双语模型对待识别语音数据进行识别的过程中,基于预设语种分类模型增加语言分类的过程,并基于在对语种识别确定之前缓存各个高层缓存层输出结果,以便后续在对语音数据相应语种进行上屏显示时能够基于与所确定语种的混合双语模型进行显示,减小多语种声学模型的计算量。Specifically, one of the core ideas of the multilingual acoustic model constructed by the present application is to combine the underlying hidden layers of multiple groups of mixed bilingual models to generate a shared hidden layer, and reduce the memory consumption of the built multilingual acoustic model based on the merged shared hidden layers; and in the process of using the mixed bilingual model to recognize the speech data to be recognized, the language classification process is added based on the preset language classification model, and the output results of each high-level cache layer are cached before the language recognition is determined, so that the corresponding language of the speech data can be subsequently displayed on the screen. It can be displayed based on the mixed bilingual model with the determined language, reducing the amount of calculation of the multilingual acoustic model.
在实际应用中,可采用共享隐含层、预设语种模型的隐含层、高层隐含层、预设语种分类模型以及所构建的混合输出层,生成多语种声学模型。In practical applications, the multilingual acoustic model can be generated by using the shared hidden layer, the hidden layer of the preset language model, the high-level hidden layer, the preset language classification model and the constructed mixed output layer.
对于共享隐含层的构建,对于现有多语种声学模型中分别对各组混合双语种,例如英-德、英-法等声学模型所进行的单独建模,其各个混合双语模型中分别具有多层隐含层的神经网络,而在混合双语模型的多层隐含层中,例如N层可以包含与其他混合双语种模型具有参数共性的隐含层,在这些具有参数共性的隐含层中,每层的神经网络可用于提取语种模型中的各个共性特征向量维度,例如针对不同语种间均普遍存在的停顿、音节长短等特,混合双语种模型的多层隐含层中还可以包含具有其本身语种特征的隐含层,其中可将具有参数共性的隐含层称之为底层隐含层,将具有明显语种特征的隐含层称之为高层隐含层。需要说明的是,各组混合双语 种可基于在此地区应用较为广泛的预设语种与其他语种进行分别混合构建,并不限定于英-德、英-法等声学模型。For the construction of the shared hidden layer, for the separate modeling of each group of mixed bilinguals in the existing multilingual acoustic model, such as English-German, English-French and other acoustic models, each mixed bilingual model has a neural network with multiple hidden layers. The characteristic vector dimension, such as pauses and syllable lengths that are common among different languages, the multi-layer hidden layer of the mixed bilingual model can also include a hidden layer with its own language characteristics. Among them, the hidden layer with parameter commonality can be called the bottom hidden layer, and the hidden layer with obvious language characteristics can be called the high-level hidden layer. It should be noted that each group of mixed bilinguals can be constructed based on the preset languages widely used in this region and other languages, and is not limited to English-German, English-French and other acoustic models.
参照图3,示出了本申请提供的多语种声学模型的模型示意图,为了降低所建立的多语种声学模型对内存的消耗,以及减少多语种声学模型的计算量,此时可将多个混合双语模型,例如英-德、英-法等多套混合双语种模型中所包含的隐含层基于是否具有明显语种特征进行划分,通常可按照预设比例,例如各个混合双语模型中具有80%的底层隐含层与20%的高层隐含层进行划分,对底层隐含层进行合并为共享隐含层,以及对明显带有特定语系语种特征的高层隐含层进行保留,基于对底层隐含层的引入提高所构建的多语种声学模型在设备上的硬件适配度。Referring to FIG. 3 , it shows a schematic diagram of the multilingual acoustic model provided by the present application. In order to reduce the memory consumption of the established multilingual acoustic model and reduce the calculation amount of the multilingual acoustic model, multiple mixed bilingual models, such as English-German, English-French and other mixed bilingual models, can divide the hidden layers contained in multiple sets of mixed bilingual models based on whether they have obvious language characteristics. Usually, according to a preset ratio, for example, each mixed bilingual model has 80% of the underlying hidden layer and 20% of the high-level hidden layer. Divide, merge the underlying hidden layers into shared hidden layers, and retain the high-level hidden layers that obviously have the characteristics of a specific language family, based on the introduction of the underlying hidden layers, improve the hardware adaptation of the multilingual acoustic model built on the device.
在所构建的多语种声学模型中,针对所合并的共享隐含层的输出结果,可作为各个所保留的高层隐含层的输入项,以便后续在高层隐含层输出时基于与所确定语种的混合输出层对语音数据按照相应语种显示,在降低内存消耗和减少计算量的同时,基于对高层隐含层的保留提高针对多语种声学模型的建模精度。In the constructed multilingual acoustic model, the output result of the merged shared hidden layer can be used as the input item of each retained high-level hidden layer, so that the voice data can be displayed in the corresponding language based on the mixed output layer with the determined language when outputting the high-level hidden layer. While reducing memory consumption and calculation load, the modeling accuracy of the multilingual acoustic model can be improved based on the reservation of the high-level hidden layer.
在所构建的多语种声学模型中,在将所合并的共享隐含层的输出结果作为高层隐含层输入项的同时,还可以在共享隐含层的输出层增加预设语种模型的隐含层,使得将共享隐含层的输出结果作为预设语种模型的输入项,以便在语种确定之前能够将语音数据按照预设语种进行上屏显示,减少上屏显示的时延,以及对上屏显示结果的频繁语种变动,提升用户的上屏体验。其中,预设语种可以指的是对于某个地区而言较为广泛应用的语种,例如对于欧洲地区国家的用户语音语种的识别而言,英文为欧洲地区广泛应用的语种,此时可在共享隐含层的输出层增加预设英文模型的隐含层,为所构建的多语种声学模型提供以英文语种上屏显示的同时,提供兜底效果。In the multilingual acoustic model constructed, while the output result of the merged shared hidden layer is used as the input item of the high-level hidden layer, the hidden layer of the preset language model can also be added to the output layer of the shared hidden layer, so that the output result of the shared hidden layer can be used as the input item of the preset language model, so that the voice data can be displayed on the screen according to the preset language before the language is determined, reducing the time delay of the upper screen display, and the frequent language changes of the upper screen display results, and improving the user's screen experience. Among them, the preset language can refer to a language that is widely used in a certain region. For example, for the recognition of user voice languages in European countries, English is a language widely used in Europe. At this time, the hidden layer of the preset English model can be added to the output layer of the shared hidden layer, so that the multilingual acoustic model can be displayed on the screen in English and provide a bottom-up effect.
确定语音语种的方式可引入预设语种分类模型实现,多个混合双语模型的高层隐含层具有与各个混合双语模型相应的语种特征,具体可以将具有与各个混合双语模型相应的语种特征的高层隐含层的多个输出层,合并作为预设语种分类模型的输入层以构建预设语种分类模型,在确定语音数 据的语种时,可通过预设语种分类模型对各个语种分类的置信度确定相应语种。The method of determining the speech language can be realized by introducing a preset language classification model. The high-level hidden layers of multiple mixed bilingual models have language characteristics corresponding to each mixed bilingual model. Specifically, multiple output layers of the high-level hidden layers with language characteristics corresponding to each mixed bilingual model can be combined as the input layer of the preset language classification model to build a preset language classification model.
其中,用于训练语种分类模型的语种特征主要是具有高层次抽象特征,如图3所示,此高层次抽象特征是基于多个混合双语模型的底层隐含层以及高层隐含层的输出,由于其模型中的神经网络可直接使用已具有高层次抽象特征的语种特征,不需要再前置很大的特征提取隐含层,在基于高层次抽象特征构建语种分类模型的情况下,能够在减少模型中神经网络层数的同时,还能够保证为模型提供所需的特征维度,以基于较少的神经网络层数保证语种分类模型的低延时和低计算量的情况下,同时基于计算缓存的高层次抽象特征的拼接保证模型的高识别效果。Among them, the language features used to train the language classification model mainly have high-level abstract features. As shown in Figure 3, this high-level abstract feature is based on the output of the bottom hidden layer and high-level hidden layer of multiple mixed bilingual models. Since the neural network in the model can directly use language features that already have high-level abstract features, there is no need to prepend a large hidden layer for feature extraction. The number of neural network layers guarantees the low latency and low computational load of the language classification model, and at the same time, the splicing of high-level abstract features based on the calculation cache ensures the high recognition effect of the model.
对于多语种声学模型中混合输出层的构成,其所分别独立构成的混合输出层,可用于对待识别语音数据中的语种进行解码以及进行相应语种的上屏显示,在具体情况下,为了在保证语种识别准确定的同时,减少多语种声学模型的计算量,首先可对多个混合双语模型的高层隐含层的输出结果进行缓存,待在基于预设语种分类模型确定语音数据的语种后,再将高层隐含层的输出结果输入至相应混合双语模型的混合输出层进行处理,即可设置多个混合双语模型的高层隐含层分别独立构成的混合输出层在基于预设语种分类模型确定相应语种后进行softmax。For the composition of the mixed output layer in the multilingual acoustic model, the independently formed mixed output layer can be used to decode the language in the speech data to be recognized and display the corresponding language on the screen. In specific cases, in order to ensure the accuracy of language recognition and reduce the amount of calculation of the multilingual acoustic model, first, the output results of the high-level hidden layers of multiple mixed bilingual models can be cached. The mixed output layer of the bilingual model can be processed, that is, the mixed output layer formed independently by the high-level hidden layers of multiple mixed bilingual models can be softmaxed after the corresponding language is determined based on the preset language classification model.
对多个混合双语模型的高层隐含层的输出结果进行缓存,保证在确定语种前不对高层隐含层的输出结果进行softmax计算,所进行的softmax计算是在机器学习中的工具,其可以用于计算一组数值中每个值的占比,以基于所计算得到的占比对语音数据中各个词的相似程度进行确定,并筛选得到用于上屏显示的词语。The output results of the high-level hidden layers of multiple mixed bilingual models are cached to ensure that the softmax calculation is not performed on the output results of the high-level hidden layers before the language is determined. The softmax calculation performed is a tool in machine learning. It can be used to calculate the proportion of each value in a set of values, so as to determine the similarity of each word in the voice data based on the calculated proportion, and screen the words for display on the screen.
如图3所示,可以缓存各个混合双语模型中最后一层softmax(即所构造的混合输出层)之前的隐含层输出(即高层隐含层的输出结果),若此时不缓存则每个混合双语模型的混合输出层均需进行softmax计算,为了保证将计算量的降低,此时可暂停各个混合双语模型的混合输出层计算,即在确定语种之前不再进行任何softmax计算,使得在确定语种后再次启动混合输出层的softmax计算,但此时仅需通过与所确定语种相对应的混 合双语模型的混合输出层对待识别语音数据进行softmax计算即可。As shown in Figure 3, the hidden layer output (i.e. the output result of the high-level hidden layer) before the last layer softmax (i.e. the constructed mixed output layer) in each mixed bilingual model can be cached. If it is not cached at this time, the mixed output layer of each mixed bilingual model needs to perform softmax calculation. The mixed output layer of the mixed bilingual model corresponding to the language can perform softmax calculation on the speech data to be recognized.
需要说明的是,多个混合双语模型为包括多层隐含层的神经网络,其所采用的是LSTM结构,那么基于各个混合双语模型中共享隐含层与高层隐含层进行构建的多语种声学模型采用的也是LSTM结构,在LSTM结构中,待识别语音数据的每一帧数据的隐含层维度均不会随着时间的推移而增加,即隐含层维度是固定的,当待识别的语音数据的帧数为20帧,所构建的多语种声学模型中隐含层的维度为512时,其在语种识别与语音数据上屏显示过程中所占用的内存可以为20*20*512*4byte=0.78MB,所构建的多语种声学模型适用于云端存储。It should be noted that multiple mixed bilingual models are neural networks including multiple layers of hidden layers, which adopt the LSTM structure. The multilingual acoustic model constructed based on the shared hidden layer and the high-level hidden layer in each mixed bilingual model also adopts the LSTM structure. When the dimension of the hidden layer in the model is 512, the memory occupied by it in the process of language recognition and voice data display on the screen can be 20*20*512*4byte=0.78MB, and the multilingual acoustic model constructed is suitable for cloud storage.
步骤202,根据待识别的语音数据和多语种声学模型,得到针对各语种的置信度; Step 202, according to the voice data to be recognized and the multilingual acoustic model, the confidence level for each language is obtained;
在获取基于多个混合双语模型中所包含的多层隐含层的神经网络实现,具体为基于多个混合双语模型的共享隐含层融合得到的多语种声学模型后,可采用多语种声学模型对待识别的语音数据进行识别,得到针对各个语种的置信度,以便后续能够基于所得到的置信度确定待识别语音数据对应的语种。After obtaining the neural network implementation based on the multi-layer hidden layers contained in multiple mixed bilingual models, specifically the multilingual acoustic model based on the fusion of the shared hidden layers of multiple mixed bilingual models, the multilingual acoustic model can be used to identify the speech data to be recognized, and obtain the confidence for each language, so that the language corresponding to the speech data to be recognized can be determined based on the obtained confidence.
具体的,多个混合双语模型的隐含层包括按照预设比例区分为底层隐含层和高层隐含层,且底层隐含层用于合并生成用于构建多语种声学模型的共享隐含层,此时可将待识别的语音数据经由多个混合双语模型的共享隐含层得到不具有明显语种特征的第一输出结果,再将第一输出结果分别输出至多个混合双语模型的多个高层隐含层,分别得到具有明显语种特征的第二输出结果,然后可采用所得到的多个第二输出结果,将其作为预设语种分类模型的输入项得到针对各语种的多个置信度。Specifically, the hidden layers of multiple mixed bilingual models include a bottom hidden layer and a high-level hidden layer according to a preset ratio, and the bottom hidden layer is used to merge to generate a shared hidden layer for building a multilingual acoustic model. At this time, the voice data to be recognized can be passed through the shared hidden layers of multiple mixed bilingual models to obtain a first output result without obvious language characteristics, and then the first output results are respectively output to multiple high-level hidden layers of multiple mixed bilingual models to obtain second output results with obvious language characteristics, and then the obtained multiple second outputs can be used As a result, using it as an input item of the preset language classification model results in multiple confidence levels for each language.
将待识别的语音数据输入多个混合双语模型的共享隐含层,得到第一输出结果,其中共享隐含层可以指的是在各个不同混合双语种模型中具有参数共性的隐含层,例如针对不同语种间均普遍存在的停顿、音节长短等特征的隐含层,此时所得到的第一输出结果不具有明显的语种特征,暂时不能用于语种识别的判定。Input the speech data to be recognized into the shared hidden layers of multiple mixed bilingual models to obtain the first output result, wherein the shared hidden layer may refer to hidden layers with common parameters in different mixed bilingual models, such as hidden layers for features such as pauses and syllable lengths that are common among different languages.
高层隐含层可以指的是在多个混合双语模型中具有明显语种特征的 隐含层,此时基于高层隐含层所输出的多个第二输出结果,可以分别携带有针对各个特定语系的语种特征,这种输出结果可用于进行语种识别的判定。此时可暂时对多个混合双语模型的高层隐含层的输出结果,即多个第二输出结果进行缓存,保证在确定语种前不对高层隐含层的输出结果进行softmax计算,即此时可暂停各个混合双语模型的混合输出层计算,在确定语种之前不再进行任何softmax计算,使得后续能够在确定语种后再次启动混合输出层的softmax计算时,保证仅需通过与所确定语种相对应的混合双语模型的混合输出层对待识别语音数据进行softmax计算即可,以达到降低模型计算量的目的。The high-level hidden layer can refer to the hidden layer with obvious language characteristics in multiple mixed bilingual models. At this time, based on the multiple second output results output by the high-level hidden layer, they can respectively carry the language characteristics for each specific language family. This output result can be used for language recognition. At this time, the output results of the high-level hidden layers of multiple mixed bilingual models, that is, multiple second output results, can be temporarily cached to ensure that the softmax calculation is not performed on the output results of the high-level hidden layers before the language is determined. That is, the mixed output layer calculation of each mixed bilingual model can be suspended at this time. In order to achieve the purpose of reducing the calculation amount of the model.
其中,所缓存的多个第二输出结果需在确定识别语种的情况下进行softmax计算,此时可基于所引入的预设语种模型,例如英文模型的隐含层,可保证在不做softmax所导致不能实时上屏期间的上屏显示,具体在确定语种结果之前,可通过预设英文模型的softmax计算,并采用softmax计算的结果对待识别的语音数据进行上屏显示,能够降低用户在确定语种结果期间对上屏的等待时间,以及避免由于在确定语种分类结果前不做softmax而不能进行实时上屏影响体验。Among them, the cached multiple second output results need to be softmax calculated when the recognition language is determined. At this time, based on the introduced preset language model, such as the hidden layer of the English model, it can ensure that the on-screen display cannot be displayed on the screen in real time due to the absence of softmax. Without softmax, real-time screen uploading will affect the experience.
为了实现对待识别语音数据的语种进行确定,可引入预设语种分类模型实现,高层隐含层的输出结果携带有针对特地语系的明显语种特征,具体可将多个混合双语模型的高层隐含层的多个输出层,作为用于训练预设语种分类模型的输入层构造预设语种分类模型,以便后续在确定语音数据的语种时可通过预设语种分类模型对各个语种分类的置信度确定相应语种。In order to determine the language of the voice data to be recognized, a preset language classification model can be introduced. The output results of the high-level hidden layer carry obvious language characteristics for the specific language family. Specifically, multiple output layers of the high-level hidden layer of multiple mixed bilingual models can be used as input layers for training the preset language classification model to construct a preset language classification model, so that when determining the language of the voice data, the corresponding language can be determined through the confidence of the preset language classification model for each language classification.
具体的,各个混合双语模型的高层隐含层分别具有与各个混合双语模型相应的语种特征,那么高层隐含层的输出结果是具有明显语种的语言色彩的,此时可将用于表征第二输出结果的多维特征向量按照相应维度拼接,如图3所示将各个语种特征向量,例如德语hidden隐含特征、法语隐含hidden特征进行拼接,并将拼接后的语种特征作为预设语种分类模型的输入层,所构建的预设语种分类模型可以具有M层卷积层conformer,此卷积层用于进行语种softmax得分计算得到针对各个语种的置信度。所拼接 的语种特征为高层次抽象特征,此时可基于所拼接语种特征之间的语种差异化能够在不需要识别完整的语音请求音频的情况下,在很短的时间内输出得到针对不同语种的多个置信度,所输出的置信度可用于对实时语种进行判定,能够保证语音识别系统在进行混合模型决策时的实时性及语种分类的准确度。Specifically, the high-level hidden layers of each mixed bilingual model have language features corresponding to each mixed bilingual model, so the output result of the high-level hidden layer has an obvious language color. At this time, the multi-dimensional feature vector used to represent the second output result can be spliced according to the corresponding dimensions. M-layer convolutional layer conformer, this convolutional layer is used to calculate the softmax score of the language to obtain the confidence for each language. The spliced language features are high-level abstract features. At this time, based on the language differentiation between the spliced language features, multiple confidence levels for different languages can be output in a short period of time without the need to recognize the complete voice request audio. The output confidence levels can be used to judge real-time languages, which can ensure the real-time performance of the speech recognition system when making hybrid model decisions and the accuracy of language classification.
需要说明的是,用于训练语种分类模型的语种特征主要是具有高层次抽象特征,如图3所示,此高层次抽象特征是基于多个混合双语模型的底层隐含层以及高层隐含层的输出,属于计算缓存,由于其模型中的神经网络可直接使用已具有高层次抽象特征的语种特征,不需要再前置很大的特征提取隐含层,此时在构建语种分类模型的情况下,能够在减少模型中神经网络层数的同时,还能够保证为模型提供所需的特征维度,以基于较少的神经网络层数保证语种分类模型的低延时和低计算量的情况下,同时基于高层次抽象特征的拼接保证模型的高识别效果。It should be noted that the language features used to train the language classification model mainly have high-level abstract features. As shown in Figure 3, this high-level abstract feature is based on the output of the bottom hidden layer and high-level hidden layer of multiple mixed bilingual models. , in the case of low latency and low calculation of the language classification model based on fewer neural network layers, and high recognition effect of the model based on splicing of high-level abstract features.
步骤203,基于针对各语种的置信度确定待识别的语音数据对应的语种。Step 203: Determine the language corresponding to the voice data to be recognized based on the confidence level for each language.
在实际应用中,可对待识别的语音数据进行实时解码,并将实时解码得到的连续预设长度帧的词输入至语种分类模型确定实时语音片段的语种结果。具体的,基于用户体验至上的语种分类决策设计,可以将实时解码得到的连续预设长度帧的词输入至语种分类模型,通过语种分类模型的语种softmax计算得到所连续预设长度帧的词针对各个语种的置信度,基于置信度确定待识别的语音数据对应的语种。In practical applications, the speech data to be recognized can be decoded in real time, and the words of continuous preset length frames obtained by real-time decoding can be input into the language classification model to determine the language result of the real-time speech segment. Specifically, based on the user experience-oriented language classification decision-making design, the words of continuous preset length frames obtained by real-time decoding can be input into the language classification model, and the confidence of the words of the continuous preset length frames for each language is obtained through the language softmax calculation of the language classification model, and the language corresponding to the speech data to be recognized is determined based on the confidence.
其中,针对各语种的置信度可以用于表示待识别的语音数据与各语种的识别可能性,那么在确定语音数据的语种时,可通过预设语种分类模型基于各个语种的多个置信度确定待识别的语音数据对应的语种。具体的,可基于置信度与预设值的判断结果,确定针对所输入词的实时语种结果。Wherein, the confidence level for each language can be used to represent the recognition possibility of the voice data to be recognized and each language, then when determining the language of the voice data, the language corresponding to the voice data to be recognized can be determined based on the multiple confidence levels of each language through the preset language classification model. Specifically, the real-time language result for the input word may be determined based on the judgment result of the confidence level and the preset value.
对于所输入的连续预设长度帧中的各个词而言,无论待识别语音数据进行解码出字时是否已经超时(即超过字数),在一种情况下,若有且仅有一个置信度大于预设值,即存在超过某一语种的置信度阈值的某个置信度,则可确定该置信度对应的语种为所述待识别语音数据对应的语种;在 另一种情况下,若存在两个或两个以上置信度大于预设值,则可确定置信度值最大的对应的语种为待识别语音数据对应的语种;在又一种情况下,若多个置信度均未达到预设值,则可以将置信度值最大的对应的语种为所述待识别语音数据的语种。其中,预设值可以是针对各个语种的置信度阈值,对此,本发明实施例不加以限制。For each word in the input continuous preset length frame, regardless of whether the speech data to be recognized has timed out (i.e. exceeds the number of words) when decoding the word, in one case, if there is one and only one confidence degree greater than the preset value, that is, there is a certain confidence degree that exceeds the confidence threshold of a certain language, then it can be determined that the language corresponding to the confidence degree is the language corresponding to the speech data to be recognized; The language of the voice data to be recognized is the language corresponding to the voice data to be recognized; in another case, if the multiple confidence levels do not reach the preset value, the corresponding language with the highest confidence value can be the language of the voice data to be recognized. Wherein, the preset value may be a confidence threshold for each language, which is not limited in this embodiment of the present invention.
示例性地,假设对于待识别的语音数据中第2帧~第5帧的语种识别能够达到语种识别快速和准确的密度低,那么在对语音数据进行解码出字时并未超时,即并未超过5个字,此时只要连续预设长度帧,例如连续5帧中的各个词针对某个语种的语种分类置信度(即第21维的softmax最大得分)均超过置信度阈值0.8,则表示对待识别的语音数据的语种识别结束,否则需要继续对待识别的语音数据中最近的连续5帧语音数据的语种进行判断;若对待识别语音数据进行解码出字时已超时,即已超过5个字,假设连续预设长度帧,例如连续五帧中的各个词针对各个语种的置信度均未达到各个语种的置信度阈值,即并未达到置信度标准,此时可以从最近5帧的语音数据中确定置信度最高分的语种结果作为最后的语种分类结果。For example, assuming that the language recognition of the 2nd to 5th frames of the speech data to be recognized can achieve fast and accurate language recognition with low density, then when the speech data is decoded into words, there is no timeout, that is, no more than 5 words. At this time, as long as the preset length frames are continuous, for example, the language classification confidence of each word in 5 consecutive frames for a certain language (i.e. the maximum softmax score of the 21st dimension) exceeds the confidence threshold of 0.8, it means that the language recognition of the speech data to be recognized is over, otherwise it needs to continue Judgment is made on the language of the latest 5 consecutive frames of speech data in the speech data to be recognized; if the speech data to be recognized has been decoded and the words have exceeded 5 words, assuming continuous preset length frames, for example, the confidence of each word in five consecutive frames for each language has not reached the confidence threshold of each language, that is, the confidence standard has not been reached. At this time, the language result with the highest confidence score can be determined from the speech data of the last 5 frames as the final language classification result.
在本申请中,还可以对待识别的语音数据进行解码,并对解码后的语音数据进行实时显示。In this application, the voice data to be recognized can also be decoded, and the decoded voice data can be displayed in real time.
具体的,可表现为在确定待识别的语音数据对应的语种之前,解述待识别的语音数据并对解码后的语音数据进行预设语种的显示,以及在确定待识别的语音数据对应的语种之后,采用与所确定语种对应的混合双语模型对待识别的语音数据进行解码,并继续对解码后的语音数据进行所确定语种的替换显示。Specifically, before determining the language corresponding to the voice data to be recognized, explain the voice data to be recognized and display the decoded voice data in a preset language, and after determining the language corresponding to the voice data to be recognized, use a mixed bilingual model corresponding to the determined language to decode the voice data to be recognized, and continue to display the decoded voice data in the determined language.
在具体实现中,对于确定语种前的上屏显示,在将步骤202中的第一输出结果输入至多个混合双语模型的高层隐含层后,可对其输出结果,即多个第二输出结果进行缓存,保证在确定语种前不对高层隐含层的输出结果进行softmax计算,即此时可暂停各个混合双语模型的混合输出层计算,在确定语种之前不再进行任何softmax计算,那么在得到第一输出结果后,在所构建的多语种声学模型中,共享隐含层的输出层可以增加预设语种模型的隐含层,即在将所合并的共享隐含层的输出结果,即第一输出结果分 别输入多个混合双语模型的高层隐含层的同时,还可以将第一输出结果输入预设语种模型的隐含层,得到第三输出结果,以便在确定待识别的语音数据对应的语种之前,能够采用第三输出结果对待识别的语音数据进行预设语种的显示。In a specific implementation, for the upper-screen display before the language is determined, after the first output result in step 202 is input to the high-level hidden layers of multiple mixed bilingual models, the output results, that is, multiple second output results can be cached to ensure that the softmax calculation is not performed on the output results of the high-level hidden layers before the language is determined, that is, the mixed output layer calculation of each mixed bilingual model can be suspended at this time, and no softmax calculation is performed before the language is determined. The output layer of the containing layer can increase the hidden layer of the preset language model, that is, when the output results of the merged shared hidden layer, that is, the first output results are respectively input into the high-level hidden layers of multiple mixed bilingual models, the first output result can also be input into the hidden layer of the preset language model to obtain the third output result, so that before the language corresponding to the speech data to be recognized is determined, the third output result can be used to display the speech data to be recognized in the preset language.
示例性地,如图3所示,预设语种可以指的是对于某个地区而言较为广泛应用的语种,例如对于欧洲地区国家的用户语音语种的识别而言,英文为欧洲地区广泛应用的语种,此时可在共享隐含层的输出层增加预设英文模型的隐含层,为所构建的多语种声学模型提供以英文语种上屏显示的同时,提供兜底效果。Exemplarily, as shown in FIG. 3 , the preset language can refer to a language that is widely used in a certain region. For example, for the recognition of user speech languages in countries in the European region, English is a language widely used in the European region. At this time, the hidden layer of the preset English model can be added to the output layer of the shared hidden layer, so that the multilingual acoustic model can be displayed on the screen in English and at the same time provide a bottom-up effect.
以及,为了避免造成上屏结果频繁变动的问题,可以在确定语种后对上屏结果进行语种替换操作。具体的,如图3所示,多个混合双语模型的高层隐含层分别独立构成混合输出层,此时在确定待识别的语音数据对应的语种后,可以采用待识别的语音数据的语种相应的混合双语模型,对所显示的语音信息以待识别的语音数据对应的语种进行替换显示,具体可基于所识别的语种将所缓存的与该语种相应的混合双语模型的高层隐含层的输出结果,基于此混合双语模型的混合输出层进行softmax计算进行语音信息的输出,对之前以英文显示的上屏信息进行相应替换,实现上屏显示的低耗时以及低时延,提高用户体验。And, in order to avoid the problem of frequent changes in the results on the screen, the language replacement operation can be performed on the results on the screen after the language is determined. Specifically, as shown in Figure 3, the high-level hidden layers of a plurality of mixed bilingual models independently form a mixed output layer. At this time, after the language corresponding to the speech data to be recognized is determined, a mixed bilingual model corresponding to the language of the speech data to be recognized can be used to replace and display the displayed speech information with the language corresponding to the speech data to be recognized. Specifically, based on the recognized language, the output result of the high-level hidden layer of the mixed bilingual model corresponding to the language that is cached can be used. Based on the mixed output layer of the mixed bilingual model, the softmax calculation can be performed for speech The output of information replaces the information displayed on the screen in English before, so as to realize the low time consumption and delay of the screen display and improve the user experience.
在实际应用中,可按照与实时语种分类结果匹配的混合双语模型的softmax计算,对上屏显示结果进行相应语种的替换。其中,待识别的语音数据可能是混合语种音频,例如英文+当地语种地点的音频(假设英文+法文),此时可以在确定识别语种后,激活对应的双语混合声学模型,并采用此模型对之前所缓存的缓存输出层的softmax进行计算,实现对每个单词的识别,并将所识别为法文的单词对上屏结果中的单词进行替换。In practical applications, the results displayed on the upper screen can be replaced by corresponding languages according to the softmax calculation of the mixed bilingual model that matches the real-time language classification results. Among them, the voice data to be recognized may be mixed language audio, such as English + local language audio (assuming English + French). At this time, after the recognition language is determined, the corresponding bilingual mixed acoustic model can be activated, and this model is used to calculate the softmax of the previously cached cache output layer to realize the recognition of each word, and replace the recognized French word with the word in the upper screen result.
在本申请中,通过采用基于多个混合双语模型的共享隐含层融合生成的多语种声学模型,对待识别的语音数据进行识别得到针对各个语种的置信度,以基于所得到的置信度确定待识别语音数据对应的语种,完成对该语音的多语种识别。通过基于多个混合双语模型的共享隐含层融合得到的多语种声学模型对语音的多语种进行识别,基于模型中的共享隐含层降低 传统多语种识别模型中的计算量,提高对语种识别的效率,进而提升用户体验。In this application, by adopting the multilingual acoustic model generated by fusion of shared hidden layers based on multiple mixed bilingual models, the speech data to be recognized is recognized to obtain the confidence degree for each language, and the language corresponding to the speech data to be recognized is determined based on the obtained confidence degree, and the multilingual recognition of the speech is completed. The multilingual acoustic model based on the fusion of the shared hidden layer of multiple mixed bilingual models is used to recognize the multilingual speech. Based on the shared hidden layer in the model, the amount of calculation in the traditional multilingual recognition model is reduced, the efficiency of language recognition is improved, and the user experience is improved.
参照图4,示出了本申请提供的多语种声学模型的应用示意图,所构建的多语种声学模型可应用在对用户的个性化语音进行识别的场景,其所对应的声学语言工作机制,可以分为解码和上屏显示两个阶段。Referring to FIG. 4 , it shows a schematic diagram of the application of the multilingual acoustic model provided by the present application. The constructed multilingual acoustic model can be applied to the scene of recognizing the user's personalized voice, and the corresponding acoustic language working mechanism can be divided into two stages of decoding and displaying on the screen.
具体的,在采用所构建的多语种声学模型确定语种之前,能够基于此地区广泛应用的语种相关模型,例如英文声学模型和英文语言模型的工作,解码出流式上屏结果,保证用户体验,且同时可基于多语种声学模型缓存下隐层输出以用于语种判定;在确定语种之后,如图3所示的英文分支不再进行计算,此时可根据语种结果采用相应的混合双语模型进行softmax计算,即通过基于此混合双语模型中基于高层缓存层构成的混合输出层进行softmax计算,以将softmax计算的结果用于上屏结果替换,并同时调用对应语种的语言模型进行正常解码。Specifically, before using the constructed multilingual acoustic model to determine the language, based on the language-related models widely used in this region, such as the English acoustic model and the English language model, the streamed screen results can be decoded to ensure user experience, and at the same time, the output of the hidden layer can be cached based on the multilingual acoustic model for language determination; The mixed output layer composed of multiple layers performs softmax calculation, so that the result of softmax calculation is used to replace the result on the screen, and at the same time, the language model of the corresponding language is called for normal decoding.
用户的语音识别可以表现为在对待识别的语音数据进行解码的过程中,可基于用户IP(Internet Protocol,通信协议)地址以调用对应IP地址所在城市的资源信息与基于多语种声学模型确定的语种信息,提高对待识别语音数据的识别率。其中,如图4所示,资源可以指的是使用地名训练的额外Ngram模型(一种基于统计语言模型的算法),通用的神经网络NNLM(Nerual Network Language Model)可以基于与所识别的语种对应的整个国家的POI(Point of Interest,兴趣点)地名相关的文本训练得到,而个性化城市级模型,与通用的神经网络NNLM相比,其主要是基于对应城市的地名文本训练的POI数据(少量)得到,且出于计算量和存储的考虑,个性化城市级模型的体积较小,完成对个性化语言模型的构建。The user's speech recognition can be manifested as in the process of decoding the speech data to be recognized, based on the user's IP (Internet Protocol, communication protocol) address, the resource information of the city where the corresponding IP address is located and the language information determined based on the multilingual acoustic model can be called to improve the recognition rate of the speech data to be recognized. Among them, as shown in Figure 4, the resource can refer to the additional Ngram model (an algorithm based on statistical language model) trained by place names. The general neural network NNLM (Nerual Network Language Model) can be obtained based on the text training related to the place name of the POI (Point of Interest) place name of the entire country corresponding to the recognized language. The personalized city-level model, compared with the general neural network NNLM, is mainly based on the POI data of the place name text training of the corresponding city ( A small amount), and in consideration of the amount of calculation and storage, the size of the personalized city-level model is small, and the construction of the personalized language model is completed.
在本应用场景中,可借助所构建的多语种声学模型,基于对用户待识别语音数据的语种确定,以及用户的资源信息对语言模型进行构建,以便综合采用用户资源信息实现对用户语言的识别,提高语言识别的准确性。In this application scenario, the multilingual acoustic model can be used to build a language model based on the language of the user's voice data to be recognized and the user's resource information, so as to comprehensively use the user's resource information to realize the recognition of the user's language and improve the accuracy of language recognition.
需要说明的是,对于方法,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。 其次,本领域技术人员也应该知悉,说明书中所描述的均属于优选,所涉及的动作并不一定是本申请所必需的。It should be noted that for the method, for the sake of simple description, it is expressed as a series of action combinations, but those skilled in the art should know that the application is not limited by the described action order, because according to the application, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that what is described in the specification is preferred, and the actions involved are not necessarily required by the present application.
参照图5,示出了本申请提供的语音的多语种识别装置的结构框图,具体可以包括如下模块:Referring to Fig. 5, it shows a structural block diagram of the multilingual recognition device for speech provided by the present application, which may specifically include the following modules:
多语种声学模型获取模块501,用于获取待识别的语音数据和多语种声学模型;所述多语种声学模型基于多个混合双语模型的共享隐含层融合得到;The multilingual acoustic model acquisition module 501 is used to acquire speech data to be recognized and a multilingual acoustic model; the multilingual acoustic model is obtained based on the fusion of shared hidden layers of multiple mixed bilingual models;
置信度生成模块502,用于根据所述待识别的语音数据和所述多语种声学模型,得到针对各语种的置信度;Confidence degree generation module 502, for obtaining the confidence degree for each language according to the speech data to be recognized and the multilingual acoustic model;
语种识别模块503,用于基于所述针对各语种的置信度确定所述待识别的语音数据对应的语种。The language recognition module 503 is configured to determine the language corresponding to the voice data to be recognized based on the confidence for each language.
在语音的多语种识别装置中,所述装置还可以包括如下模块:In the multilingual recognition device of speech, described device can also comprise following module:
上屏显示模块,用于解码所述待识别的语音数据,对解码后的语音数据进行实时显示。The upper screen display module is used to decode the speech data to be recognized, and display the decoded speech data in real time.
在语音的多语种识别装置中,所述多个混合双语模型的隐含层包括按照预设比例区分为底层隐含层和高层隐含层,所述底层隐含层用于合并生成共享隐含层;置信度生成模块502可以包括如下子模块:In the multilingual recognition device of speech, the hidden layers of the plurality of mixed bilingual models include a bottom layer hidden layer and a high layer hidden layer according to a preset ratio, and the bottom layer hidden layer is used for merging to generate a shared hidden layer; the confidence generation module 502 may include the following submodules:
第一输出结果生成子模块,用于将待识别的语音数据输入多个混合双语模型的共享隐含层,得到第一输出结果;The first output result generation sub-module is used to input the voice data to be recognized into the shared hidden layer of multiple mixed bilingual models to obtain the first output result;
第二输出结果生成子模块,用于将所述第一输出结果分别输入所述多个混合双语模型的高层隐含层,得到多个第二输出结果;The second output result generation sub-module is used to input the first output results into the high-level hidden layers of the multiple mixed bilingual models respectively to obtain multiple second output results;
置信度生成子模块,用于将所述多个第二输出结果合并作为预设语种分类模型的输入项,得到针对各语种的多个置信度。The confidence generation sub-module is used to combine the multiple second output results as input items of the preset language classification model to obtain multiple confidence levels for each language.
在语音的多语种识别装置中,置信度生成子模块具体用于将用于表征第二输出结果的多维特征向量按照相应维度拼接,并将拼接后的特征向量作为所述预设语种分类模型的输入项,得到针对不同语种的多个置信度。In the speech multilingual recognition device, the confidence generation submodule is specifically used to splice the multi-dimensional feature vectors used to characterize the second output results according to corresponding dimensions, and use the spliced feature vectors as input items of the preset language classification model to obtain multiple confidence levels for different languages.
在语音的多语种识别装置中,语种识别模块503具体用于在有且仅有一个所述置信度大于预设值时,确定该置信度对应的语种为所述待识别语 音数据对应的语种;或,在存在两个或两个以上置信度大于预设值时,确定所述置信度值最大的对应的语种为待识别语音数据对应的语种;或,在所述多个置信度均未达到预设值时,将所述置信度值最大的对应的语种为所述待识别语音数据的语种。In the multilingual voice recognition device, the language recognition module 503 is specifically used to determine that the language corresponding to the confidence is the language corresponding to the voice data to be recognized when there is only one confidence value greater than a preset value; or, when there are two or more confidence values greater than a preset value, determine that the corresponding language with the largest confidence value is the language corresponding to the voice data to be recognized; The language of the voice data to be recognized.
在语音的多语种识别装置中,上屏显示模块具体用于在确定所述待识别的语音数据对应的语种之前,解码所述待识别的语音数据并对解码后的语音数据进行预设语种的显示;以及用于在确定所述待识别的语音数据对应的语种之后,采用与所确定语种对应的混合双语模型对所述待识别的语音数据进行解码,并继续对解码后的语音数据进行所确定语种的替换显示。In the voice multilingual recognition device, the upper-screen display module is specifically used for decoding the voice data to be recognized and displaying the decoded voice data in a preset language before determining the language corresponding to the voice data to be recognized; and after determining the language corresponding to the voice data to be recognized, using a mixed bilingual model corresponding to the determined language to decode the voice data to be recognized, and continue to perform replacement display of the determined language on the decoded voice data.
在语音的多语种识别装置中,所述多语种声学模型包括预设语种模型,上屏显示模块进行预设语种的显示时通过将所述第一输出结果输入预设语种模型的隐含层,得到第三输出结果;其中所述预设语种模型位于所述共享隐含层的输出层;In the multilingual speech recognition device, the multilingual acoustic model includes a preset language model, and when the upper screen display module displays the preset language, the third output result is obtained by inputting the first output result into the hidden layer of the preset language model; wherein the preset language model is located at the output layer of the shared hidden layer;
在确定所述待识别的语音数据对应的语种之前,解码所述第三输出结果以得到识别的语音信息,并以预设语种进行显示。Before determining the language corresponding to the voice data to be recognized, the third output result is decoded to obtain recognized voice information, and displayed in a preset language.
在语音的多语种识别装置中,所述多个混合双语模型的高层隐含层分别独立构成混合输出层,上屏显示模块进行所确定语种的替换显示时通过在确定所述待识别的语音数据对应的语种后,采用所述待识别的语音数据的语种相应的混合双语模型,对所显示的语音信息以所确定的语种进行替换显示。In the voice multilingual recognition device, the high-level hidden layers of the plurality of mixed bilingual models independently form a mixed output layer, and when the upper-screen display module performs the replacement display of the determined language, after determining the language corresponding to the voice data to be recognized, the displayed voice information is replaced and displayed in the determined language by using a mixed bilingual model corresponding to the language of the voice data to be recognized.
在语音的多语种识别装置中,所述多语种声学模型基于神经网络建立,所述装置还可以包括如下模块:In the multilingual recognition device of speech, the multilingual acoustic model is established based on a neural network, and the device may also include the following modules:
多语种声学模型生成模块,用于基于多个混合双语模型的共享隐含层融合生成多语种声学模型。The multilingual acoustic model generation module is used to generate a multilingual acoustic model based on the fusion of shared hidden layers of multiple mixed bilingual models.
其中,多语种声学模型生成模块具体用于将多个混合双语模型的隐含层按照预设比例区分为底层隐含层和高层隐含层,合并所述底层隐含层以生成共享隐含层,其中所述多个混合双语模型为包括多层隐含层的神经网络,所述多个混合双语模型的高层隐含层具有与各个混合双语模型相应的语种特征;在所述共享隐含层的输出层增加预设语种模型的隐含层;将所 述具有与各个混合双语模型相应的语种特征的高层隐含层的多个输出层,合并作为预设语种分类模型的输入层构建预设语种分类模型,以及将所述多个混合双语模型的高层隐含层分别独立构成混合输出层,以采用所述共享隐含层、所述预设语种模型的隐含层、所述高层隐含层、所述预设语种分类模型以及所述混合输出层,生成多语种声学模型。Among them, the multilingual acoustic model generation module is specifically used to divide the hidden layers of multiple mixed bilingual models into a bottom hidden layer and a high-level hidden layer according to a preset ratio, and merge the bottom hidden layers to generate a shared hidden layer, wherein the multiple mixed bilingual models are neural networks including multiple hidden layers, and the high-level hidden layers of the multiple mixed bilingual models have language characteristics corresponding to each mixed bilingual model; add a hidden layer of a preset language model to the output layer of the shared hidden layer; A plurality of output layers of the high-level hidden layer of language features are merged as the input layer of the preset language classification model to construct a preset language classification model, and the high-level hidden layers of the multiple mixed bilingual models are respectively independently formed into a mixed output layer, so as to use the shared hidden layer, the hidden layer of the preset language model, the high-level hidden layer, the preset language classification model and the mixed output layer to generate a multilingual acoustic model.
本申请还提供了一种车载终端,包括:The present application also provides a vehicle-mounted terminal, including:
包括上述语音的多语种识别装置、处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序,该计算机程序被处理器执行时实现上述基于语音的多语种识别方法的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。本申请还提供了一种计算机可读存储介质,计算机可读存储介质上存储计算机程序,计算机程序被处理器执行时实现上述语音的多语种识别方法的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。The multilingual recognition device comprising the above-mentioned voice, a processor, a memory, and a computer program stored on the memory and capable of running on the processor, when the computer program is executed by the processor, realizes the various processes of the above-mentioned voice-based multilingual recognition method, and can achieve the same technical effect. To avoid repetition, details are not repeated here. The present application also provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the various processes of the above-mentioned multilingual recognition method for speech can be realized, and the same technical effect can be achieved. In order to avoid repetition, details are not repeated here.
本说明书中的各个示例均采用递进的方式描述,每个示例重点说明的都是与其他示例的不同之处,各个示例之间相同相似的部分互相参见即可。Each example in this specification is described in a progressive manner, each example focuses on the difference from other examples, and the same and similar parts of each example can be referred to each other.
本领域内的技术人员应明白,本申请示例的可提供为方法、装置、或计算机程序产品。因此,本申请示例可采用完全硬件、完全软件、或结合软件和硬件方面的形式。而且,本申请示例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that what is exemplified in the present application may be provided as a method, an apparatus, or a computer program product. Accordingly, the examples of the present application may take the form of entirely hardware, entirely software, or a combination of software and hardware aspects. Furthermore, the present examples may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请示例是参照根据本申请示例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。Examples of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to examples of the present application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to a general-purpose computer, a special-purpose computer, an embedded processor or a processor of other programmable data processing terminal equipment to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing terminal equipment produce an apparatus for realizing the functions specified in one or more procedures of the flow chart and/or one or more blocks of the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory capable of directing a computer or other programmable data processing terminal equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product comprising instruction means, and the instruction means implements the functions specified in one or more flows of the flowchart and/or one or more blocks of the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing terminal equipment, so that a series of operation steps are executed on the computer or other programmable terminal equipment to generate computer-implemented processing, so that the instructions executed on the computer or other programmable terminal equipment provide steps for realizing the functions specified in one or more procedures of the flow chart and/or one or more blocks of the block diagram.
尽管已描述了本申请的优选示例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些示例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选示例以及落入本申请示例范围的所有变更和修改。While preferred examples of the present application have been described, additional alterations and modifications to these examples can be made by those skilled in the art once the basic inventive concepts are appreciated. Therefore, the appended claims are intended to be interpreted to cover the preferred examples and all changes and modifications that fall within the scope of the examples of the application.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。Finally, it should also be noted that in this document, relational terms such as first and second etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or end-equipment comprising a set of elements includes not only those elements but also other elements not expressly listed or which are inherent to such a process, method, article or end-equipment. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or terminal device comprising said element.
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Having described various embodiments of the present application above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims (12)

  1. 一种语音的多语种识别方法,其特征在于,所述方法包括:A multilingual recognition method of speech, characterized in that the method comprises:
    获取待识别的语音数据和多语种声学模型;所述多语种声学模型基于多个混合双语模型的共享隐含层融合得到;Obtain speech data to be recognized and a multilingual acoustic model; the multilingual acoustic model is obtained based on the fusion of shared hidden layers of multiple mixed bilingual models;
    根据所述待识别的语音数据和所述多语种声学模型,得到针对各语种的置信度;Obtaining confidence levels for each language according to the speech data to be recognized and the multilingual acoustic model;
    基于所述针对各语种的置信度确定所述待识别的语音数据对应的语种。The language corresponding to the voice data to be recognized is determined based on the confidence for each language.
  2. 根据权利要求1所述的方法,其特征在于,还包括:The method according to claim 1, further comprising:
    解码所述待识别的语音数据,对解码后的语音数据进行实时显示。Decoding the speech data to be recognized, and displaying the decoded speech data in real time.
  3. 根据权利要求2所述的方法,其特征在于,所述多个混合双语模型的隐含层包括按照预设比例区分为底层隐含层和高层隐含层,所述底层隐含层用于合并生成共享隐含层;The method according to claim 2, wherein the hidden layers of the plurality of mixed bilingual models include a bottom layer hidden layer and a high layer hidden layer according to a preset ratio, and the bottom layer hidden layer is used for merging to generate a shared hidden layer;
    所述根据所述待识别的语音数据和所述多语种声学模型,得到针对各语种的置信度,包括:According to the speech data to be recognized and the multilingual acoustic model, the confidence for each language is obtained, including:
    将待识别的语音数据输入多个混合双语模型的共享隐含层,得到第一输出结果;Inputting the voice data to be recognized into the shared hidden layers of multiple mixed bilingual models to obtain the first output result;
    将所述第一输出结果分别输入所述多个混合双语模型的高层隐含层,得到多个第二输出结果;Inputting the first output results into the high-level hidden layers of the multiple mixed bilingual models respectively to obtain multiple second output results;
    将所述多个第二输出结果合并作为预设语种分类模型的输入项,得到针对各语种的多个置信度。The multiple second output results are combined as input items of the preset language classification model to obtain multiple confidence levels for each language.
  4. 根据权利要求3所述的方法,其特征在于,所述将所述多个第二输出结果合并作为预设语种分类模型的输入项,得到针对各语种的多个置信度,包括:The method according to claim 3, wherein said combining said plurality of second output results as an input item of a preset language classification model to obtain a plurality of confidence levels for each language includes:
    将用于表征第二输出结果的多维特征向量按照相应维度拼接,并将拼接后的特征向量作为所述预设语种分类模型的输入项,得到针对不同语种的多个置信度。The multi-dimensional feature vectors used to characterize the second output result are spliced according to corresponding dimensions, and the spliced feature vectors are used as input items of the preset language classification model to obtain multiple confidence levels for different languages.
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述基于所述针对各语种的置信度确定所述待识别的语音数据对应的语种,包括:The method according to any one of claims 1 to 4, wherein the determining the language corresponding to the speech data to be recognized based on the confidence for each language includes:
    若有且仅有一个所述置信度大于预设值,则确定该置信度对应的语种为 所述待识别语音数据对应的语种;If there is and only one of the confidence levels is greater than a preset value, then it is determined that the language corresponding to the confidence level is the language corresponding to the voice data to be recognized;
    或,若存在两个或两个以上置信度大于预设值,则确定所述置信度值最大的对应的语种为待识别语音数据对应的语种;Or, if there are two or more confidence values greater than the preset value, then determine that the corresponding language with the largest confidence value is the language corresponding to the voice data to be recognized;
    或,若所述多个置信度均未达到预设值,则将所述置信度值最大的对应的语种为所述待识别语音数据的语种。Or, if none of the multiple confidence levels reaches the preset value, the corresponding language with the largest confidence level is the language of the voice data to be recognized.
  6. 根据权利要求3所述的方法,其特征在于,所述对解码后的语音数据进行实时显示,包括:The method according to claim 3, wherein said displaying the decoded voice data in real time comprises:
    在确定所述待识别的语音数据对应的语种之前,解码所述待识别的语音数据并对解码后的语音数据进行预设语种的显示;Before determining the language corresponding to the speech data to be recognized, decoding the speech data to be recognized and displaying the decoded speech data in a preset language;
    在确定所述待识别的语音数据对应的语种之后,采用与所确定语种对应的混合双语模型对所述待识别的语音数据进行解码,并继续对解码后的语音数据进行所确定语种的替换显示。After determining the language corresponding to the speech data to be recognized, the speech data to be recognized is decoded by using a mixed bilingual model corresponding to the determined language, and the decoded speech data is continued to be replaced with the determined language.
  7. 根据权利要求6所述的方法,其特征在于,所述多语种声学模型包括预设语种模型,所述在确定所述待识别的语音数据对应的语种之前,解析所述待识别的语音数据并对解析后的语音数据进行预设语种的显示,包括:The method according to claim 6, wherein the multilingual acoustic model includes a preset language model, and before determining the language corresponding to the speech data to be recognized, parsing the speech data to be recognized and displaying the parsed speech data in a preset language includes:
    将所述第一输出结果输入预设语种模型的隐含层,得到第三输出结果;其中所述预设语种模型位于所述共享隐含层的输出层;Inputting the first output result into the hidden layer of the preset language model to obtain a third output result; wherein the preset language model is located at the output layer of the shared hidden layer;
    在确定所述待识别的语音数据对应的语种之前,解码所述第三输出结果以得到识别的语音信息,并以预设语种进行显示。Before determining the language corresponding to the voice data to be recognized, the third output result is decoded to obtain recognized voice information, and displayed in a preset language.
  8. 根据权利要求6所述的方法,其特征在于,所述多个混合双语模型的高层隐含层分别独立构成混合输出层,所述在确定所述待识别的语音数据对应的语种之后,采用与所确定语种对应的混合双语模型对所述待识别的语音数据进行解析,并继续对解析后的语音数据进行所确定语种的显示,包括:The method according to claim 6, wherein the high-level hidden layers of the plurality of mixed bilingual models independently form a mixed output layer, and after determining the language corresponding to the speech data to be recognized, using the mixed bilingual model corresponding to the determined language to analyze the speech data to be recognized, and continuing to display the speech data after analysis The determined language includes:
    在确定所述待识别的语音数据对应的语种后,采用所述待识别的语音数据的语种相应的混合双语模型,对所显示的语音信息以所确定的语种进行替换显示。After the language corresponding to the voice data to be recognized is determined, the displayed voice information is replaced and displayed in the determined language by using a mixed bilingual model corresponding to the language of the voice data to be recognized.
  9. 根据权利要求1所述的方法,其特征在于,所述多语种声学模型基于神经网络建立,还包括:The method according to claim 1, wherein the multilingual acoustic model is established based on a neural network, further comprising:
    将多个混合双语模型的隐含层按照预设比例区分为底层隐含层和高层隐 含层,合并所述底层隐含层以生成共享隐含层,其中所述多个混合双语模型为包括多层隐含层的神经网络,所述多个混合双语模型的高层隐含层具有与各个混合双语模型相应的语种特征;The hidden layers of multiple mixed bilingual models are divided into bottom hidden layers and high-level hidden layers according to preset ratios, and the bottom hidden layers are merged to generate a shared hidden layer, wherein the multiple mixed bilingual models are neural networks comprising multiple hidden layers, and the high-level hidden layers of the multiple mixed bilingual models have language characteristics corresponding to each mixed bilingual model;
    在所述共享隐含层的输出层增加预设语种模型的隐含层;Adding a hidden layer of a preset language model to the output layer of the shared hidden layer;
    将所述具有与各个混合双语模型相应的语种特征的高层隐含层的多个输出层,合并作为预设语种分类模型的输入层构建预设语种分类模型,以及将所述多个混合双语模型的高层隐含层分别独立构成混合输出层;Combining the multiple output layers of the high-level hidden layer with language characteristics corresponding to each mixed bilingual model into the input layer of the preset language classification model to construct the preset language classification model, and independently forming a mixed output layer with the high-level hidden layers of the multiple mixed bilingual models;
    采用所述共享隐含层、所述预设语种模型的隐含层、所述高层隐含层、所述预设语种分类模型以及所述混合输出层,生成多语种声学模型。A multilingual acoustic model is generated by using the shared hidden layer, the hidden layer of the preset language model, the high-level hidden layer, the preset language classification model, and the mixed output layer.
  10. 一种语音的多语种识别装置,其特征在于,所述装置包括:A multilingual recognition device for speech, characterized in that the device includes:
    多语种声学模型获取模块,用于获取待识别的语音数据和多语种声学模型;所述多语种声学模型基于多个混合双语模型的共享隐含层融合得到;The multilingual acoustic model acquisition module is used to obtain the speech data to be recognized and the multilingual acoustic model; the multilingual acoustic model is obtained based on the fusion of shared hidden layers of multiple mixed bilingual models;
    置信度生成模块,用于根据所述待识别的语音数据和所述多语种声学模型,得到针对各语种的置信度;A confidence level generation module, configured to obtain confidence levels for each language according to the speech data to be recognized and the multilingual acoustic model;
    语种识别模块,用于基于所述针对各语种的置信度确定所述待识别的语音数据对应的语种。The language recognition module is configured to determine the language corresponding to the speech data to be recognized based on the confidence for each language.
  11. 一种车载终端,其特征在于,包括:如权利要求10所述语音的多语种识别装置、处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如权利要求1-9中任一项所述语音的多语种识别方法的步骤。A vehicle-mounted terminal, characterized in that it comprises: a multilingual recognition device for speech as claimed in claim 10, a processor, a memory, and a computer program stored on the memory and capable of running on the processor, when the computer program is executed by the processor, the steps of the multilingual recognition method for speech as described in any one of claims 1-9 are realized.
  12. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现如权利要求1-9中任一项所述语音的多语种识别方法的步骤。A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the multilingual voice recognition method according to any one of claims 1-9 are realized.
PCT/CN2022/140282 2022-01-19 2022-12-20 Multi-language recognition method and apparatus for speech, and terminal and storage medium WO2023138286A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210058785.2A CN114078468B (en) 2022-01-19 2022-01-19 Voice multi-language recognition method, device, terminal and storage medium
CN202210058785.2 2022-01-19

Publications (1)

Publication Number Publication Date
WO2023138286A1 true WO2023138286A1 (en) 2023-07-27

Family

ID=80284692

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/140282 WO2023138286A1 (en) 2022-01-19 2022-12-20 Multi-language recognition method and apparatus for speech, and terminal and storage medium

Country Status (2)

Country Link
CN (1) CN114078468B (en)
WO (1) WO2023138286A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114078468B (en) * 2022-01-19 2022-05-13 广州小鹏汽车科技有限公司 Voice multi-language recognition method, device, terminal and storage medium
CN116386609A (en) * 2023-04-14 2023-07-04 南通大学 Chinese-English mixed speech recognition method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120203540A1 (en) * 2011-02-08 2012-08-09 Microsoft Corporation Language segmentation of multilingual texts
CN103400577A (en) * 2013-08-01 2013-11-20 百度在线网络技术(北京)有限公司 Acoustic model building method and device for multi-language voice identification
US20170140759A1 (en) * 2015-11-13 2017-05-18 Microsoft Technology Licensing, Llc Confidence features for automated speech recognition arbitration
CN110634487A (en) * 2019-10-24 2019-12-31 科大讯飞股份有限公司 Bilingual mixed speech recognition method, device, equipment and storage medium
CN110895932A (en) * 2018-08-24 2020-03-20 中国科学院声学研究所 Multi-language voice recognition method based on language type and voice content collaborative classification
CN112185348A (en) * 2020-10-19 2021-01-05 平安科技(深圳)有限公司 Multilingual voice recognition method and device and electronic equipment
CN114078468A (en) * 2022-01-19 2022-02-22 广州小鹏汽车科技有限公司 Voice multi-language recognition method, device, terminal and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108615525B (en) * 2016-12-09 2020-10-09 中国移动通信有限公司研究院 Voice recognition method and device
CN107240395B (en) * 2017-06-16 2020-04-28 百度在线网络技术(北京)有限公司 Acoustic model training method and device, computer equipment and storage medium
CN112489622B (en) * 2019-08-23 2024-03-19 中国科学院声学研究所 Multi-language continuous voice stream voice content recognition method and system
CN110930980B (en) * 2019-12-12 2022-08-05 思必驰科技股份有限公司 Acoustic recognition method and system for Chinese and English mixed voice
CN111753557B (en) * 2020-02-17 2022-12-20 昆明理工大学 Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120203540A1 (en) * 2011-02-08 2012-08-09 Microsoft Corporation Language segmentation of multilingual texts
CN103400577A (en) * 2013-08-01 2013-11-20 百度在线网络技术(北京)有限公司 Acoustic model building method and device for multi-language voice identification
US20170140759A1 (en) * 2015-11-13 2017-05-18 Microsoft Technology Licensing, Llc Confidence features for automated speech recognition arbitration
CN110895932A (en) * 2018-08-24 2020-03-20 中国科学院声学研究所 Multi-language voice recognition method based on language type and voice content collaborative classification
CN110634487A (en) * 2019-10-24 2019-12-31 科大讯飞股份有限公司 Bilingual mixed speech recognition method, device, equipment and storage medium
CN112185348A (en) * 2020-10-19 2021-01-05 平安科技(深圳)有限公司 Multilingual voice recognition method and device and electronic equipment
CN114078468A (en) * 2022-01-19 2022-02-22 广州小鹏汽车科技有限公司 Voice multi-language recognition method, device, terminal and storage medium

Also Published As

Publication number Publication date
CN114078468B (en) 2022-05-13
CN114078468A (en) 2022-02-22

Similar Documents

Publication Publication Date Title
JP7127106B2 (en) Question answering process, language model training method, apparatus, equipment and storage medium
WO2023138286A1 (en) Multi-language recognition method and apparatus for speech, and terminal and storage medium
US11145314B2 (en) Method and apparatus for voice identification, device and computer readable storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
WO2021179701A1 (en) Multilingual speech recognition method and apparatus, and electronic device
CN111666409B (en) Integrated emotion intelligent classification method for complex comment text based on comprehensive deep capsule network
US20220335079A1 (en) Method for generating virtual image, device and storage medium
JP7430820B2 (en) Sorting model training method and device, electronic equipment, computer readable storage medium, computer program
CN112836487B (en) Automatic comment method and device, computer equipment and storage medium
CN115495568B (en) Training method and device for dialogue model, dialogue response method and device
JP2018073411A (en) Natural language generation method, natural language generation device, and electronic apparatus
US20230395075A1 (en) Human-machine dialogue system and method
CN109741735A (en) The acquisition methods and device of a kind of modeling method, acoustic model
EP4148727A1 (en) Speech recognition and codec method and apparatus, electronic device and storage medium
CN110782869A (en) Speech synthesis method, apparatus, system and storage medium
WO2024066920A1 (en) Processing method and apparatus for dialogue in virtual scene, and electronic device, computer program product and computer storage medium
WO2024045475A1 (en) Speech recognition method and apparatus, and device and medium
US20230094730A1 (en) Model training method and method for human-machine interaction
CN110263218A (en) Video presentation document creation method, device, equipment and medium
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
CN107967304A (en) Session interaction processing method, device and electronic equipment
CN113360683B (en) Method for training cross-modal retrieval model and cross-modal retrieval method and device
US20230317058A1 (en) Spoken language processing method and apparatus, and storage medium
CN116778040A (en) Face image generation method based on mouth shape, training method and device of model
CN113393841A (en) Training method, device and equipment of speech recognition model and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22921707

Country of ref document: EP

Kind code of ref document: A1