CN111916062A

CN111916062A - Voice recognition method, device and system

Info

Publication number: CN111916062A
Application number: CN201910376604.9A
Authority: CN
Inventors: 张仕良; 刘媛; 雷鸣; 李威
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-05-07
Filing date: 2019-05-07
Publication date: 2020-11-10

Abstract

The application discloses a voice recognition method, a voice recognition device and a voice recognition system. Wherein, the method comprises the following steps: acquiring a voice to be recognized, wherein the voice to be recognized is voice data containing at least one language; recognizing the speech to be recognized based on the recognition model to obtain a recognition result, wherein the recognition model at least comprises: the mixed acoustic model comprises acoustic models of multiple languages, the mixed language model comprises language models of the multiple languages, and the mixed dictionary comprises dictionaries of the multiple languages. The method and the device solve the technical problems that only the voice of a specific language can be recognized and the mixed voice cannot be recognized in the related technology.

Description

Voice recognition method, device and system

Technical Field

The present application relates to the field of speech recognition, and in particular, to a speech recognition method, apparatus, and system.

Background

With the rapid development of the internet and the widespread application of intelligent mobile terminals, speech recognition technology is widely applied to work, life and study of people, such as speech dialogue robots, speech assistants, and related interactive tools. These devices generally recognize a user's voice to obtain a recognition command of the user, and then perform an action corresponding to the recognized command.

However, the languages used in different countries are different, and different dialects may be used in different regions of the same country. The prior art needs to purposefully train a set of recognition systems according to each language using collected data, which usually includes a special acoustic model, a language model, a decoder, and a pronunciation dictionary, such as a schematic diagram of recognizing speech of a specific language shown in fig. 1. The input of the acoustic model is acoustic features, and the acoustic features obtain the prediction probability of the acoustic modeling unit in the acoustic model through a neural network, that is, the output of the acoustic model is the prediction probability of the acoustic modeling unit, as shown in fig. 2; the language model is an n-gram language model or a neural network language model obtained through text data training; the decoder combines the acoustic model, the language model and the pronunciation dictionary to obtain a final recognition result. The recognition system can only recognize speeches of a specific language, for example, a Chinese speech recognition system can only recognize Chinese, and an English recognition system can only recognize English.

In a practical application scenario, for example, a user who purchases a subway ticket through a subway ticket vending machine may not speak mandarin, only speak dialect, or other languages, and if the subway ticket vending machine can only recognize one language, the user may not normally purchase a ticket in the other language.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device and a voice recognition system, which are used for at least solving the technical problems that only voices of specific languages can be recognized and mixed voices cannot be recognized in the related technology.

According to an aspect of an embodiment of the present application, there is provided a speech recognition method including: acquiring a voice to be recognized, wherein the voice to be recognized is voice data containing at least one language; recognizing the speech to be recognized based on the recognition model to obtain a recognition result, wherein the recognition model at least comprises: the mixed acoustic model comprises acoustic models of multiple languages, the mixed language model comprises language models of the multiple languages, and the mixed dictionary comprises dictionaries of the multiple languages.

According to another aspect of the embodiments of the present application, there is also provided a speech recognition method, including: inputting a voice to be recognized, wherein the voice to be recognized is voice data containing at least one language; outputting feedback information corresponding to a recognition result of the speech to be recognized, wherein the recognition result is a result obtained by recognizing the speech to be recognized by a recognition model, and the recognition model at least comprises: the mixed acoustic model comprises acoustic models of multiple languages, the mixed language model comprises language models of the multiple languages, and the mixed dictionary comprises dictionaries of the multiple languages.

According to another aspect of the embodiments of the present application, there is also provided a speech recognition system, including: the voice recognition device comprises an input unit, a recognition unit and a control unit, wherein the input unit is used for acquiring a voice to be recognized, and the voice to be recognized is voice data containing at least one language; the recognition unit is used for recognizing the speech to be recognized based on the recognition model to obtain a recognition result, wherein the recognition model at least comprises: the mixed acoustic model comprises acoustic models of a plurality of languages, the mixed language model comprises language models of the plurality of languages, and the mixed dictionary comprises dictionaries of the plurality of languages; and an output unit for outputting feedback information corresponding to the recognition result.

According to another aspect of the embodiments of the present application, there is also provided a speech recognition apparatus, including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a voice to be recognized, and the voice to be recognized is voice data containing at least one language; the recognition module is used for recognizing the speech to be recognized based on the recognition model to obtain a recognition result, wherein the recognition model at least comprises: the mixed acoustic model comprises acoustic models of multiple languages, the mixed language model comprises language models of the multiple languages, and the mixed dictionary comprises dictionaries of the multiple languages.

According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program, wherein when the program is executed, an apparatus in which the storage medium is located is controlled to execute the voice recognition method.

According to another aspect of the embodiments of the present application, there is also provided a processor for executing a program, wherein the program executes to perform the speech recognition method.

In the embodiment of the application, a recognition model is adopted to recognize the to-be-recognized voice containing a plurality of languages, and after the voice data containing at least one language is obtained, the to-be-recognized voice is recognized through the recognition model, so that a recognition result is obtained. Wherein identifying the model comprises at least: the mixed acoustic model comprises acoustic models of multiple languages, the mixed language model comprises language models of the multiple languages, and the mixed dictionary comprises dictionaries of the multiple languages. It is easy to note that, because the hybrid acoustic model, the hybrid language model, and the hybrid dictionary in the recognition model in the present application are acoustic models, language models, and hybrid dictionaries of multiple languages, when the speech to be recognized only contains one language, the recognition model in the present application can recognize the speech to be recognized; when the speech to be recognized only contains a plurality of languages, the recognition model in the application can also recognize the speech data containing a plurality of languages, so that the aim of recognizing the speech to be recognized is fulfilled, and the technical effect of recognizing the speech of mixed languages is realized. And then the technical problem that only the voice of a specific language can be identified and the mixed voice cannot be identified in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a diagram illustrating a method for recognizing speech of a particular language according to the prior art;

FIG. 2 is a schematic diagram of an acoustic model according to the prior art;

FIG. 3 is a schematic diagram of an alternative computer terminal according to an embodiment of the present application;

FIG. 4 is a flow chart of a method of speech recognition according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an alternative identification system according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an alternative identification system according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an alternative hybrid acoustic model according to an embodiment of the present application;

FIG. 8 is a diagram illustrating the processing of words by an alternative hybrid language model according to an embodiment of the present application;

FIG. 9 is a schematic diagram of an alternative hybrid language model according to embodiments of the present application;

FIG. 10 is a flow chart of a method of speech recognition according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a speech recognition system according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present application;

FIG. 13 is a block diagram of a computer terminal according to an embodiment of the present application;

FIG. 14 is a schematic diagram of an alternative acoustic model according to an embodiment of the present application; and

fig. 15 is a schematic diagram of an alternative initial and final sequence according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

There is also provided, in accordance with an embodiment of the present application, a speech recognition method embodiment, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than here.

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Fig. 3 shows a hardware configuration block diagram of a computer terminal (or mobile device) for implementing the voice recognition method. As shown in fig. 3, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 104 for storing data, and a transmission device 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 3 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 3, or have a different configuration than shown in FIG. 3.

It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the speech recognition method in the embodiment of the present application, and the processor 102 executes various functional applications and data processing, i.e., implementing the method for recognizing a language, by running the software programs and modules stored in the memory 104. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

It should be noted here that in some alternative embodiments, the computer device (or mobile device) shown in fig. 3 above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 3 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.

Under the above operating environment, the present application provides a speech recognition method as shown in fig. 4. Fig. 4 is a flowchart of a speech recognition method according to a first embodiment of the present application, and as can be seen from fig. 4, the method provided by the present application includes the following steps:

step S402, acquiring a speech to be recognized, wherein the speech to be recognized is speech data including at least one language.

It should be noted that the speech to be recognized may include one language, for example, the speech to be recognized is a chinese speech, or may include multiple languages, for example, the speech to be recognized "i want to dispatch" includes both chinese and english. Alternatively, in the case where the speech to be recognized contains only one language, the speech to be recognized may also include speech data of different dialects of the same language, for example, in a chinese speech, including both cantonese and northeast.

In an alternative embodiment, the solution provided by the present application can be applied to an intelligent interactive device, such as a subway ticket vending machine, a voice conversation robot, and a voice assistant, and the intelligent interactive device has a voice collecting device, wherein the voice collecting device can be, but is not limited to, a microphone. Taking the subway ticket vending machine as an example for explanation, when a user purchases a subway ticket, the voice to be recognized is input through the voice collecting device of the subway ticket vending machine, for example, "buy a ticket to ZZ railway station", the voice collecting device of the subway ticket vending machine collects the voice of the user, and at this time, the subway ticket vending machine can obtain the voice to be recognized. Optionally, after receiving the speech to be recognized, the intelligent interaction device may further perform preprocessing, for example, noise reduction processing, on the speech to be recognized, so that the intelligent interaction device can accurately recognize the speech to be recognized.

Step S404, recognizing the speech to be recognized based on the recognition model to obtain a recognition result, wherein the recognition model at least comprises: the mixed acoustic model comprises acoustic models of multiple languages, the mixed language model comprises language models of the multiple languages, and the mixed dictionary comprises dictionaries of the multiple languages.

It should be noted that, an Acoustic Model (AM) is used to determine the probability that a text sequence emits a speech to be recognized after determining the text sequence, and a Language Model (LM) is used to predict the probability that a sequence of characters and/or words is generated, where the sequence corresponding to the speech to be recognized includes, but is not limited to, characters and words; the hybrid dictionary is a pronunciation dictionary that includes mappings from words to phonemes and is used to determine a mapping relationship between the modeling unit of the acoustic model and the modeling unit of the language model, so as to connect the acoustic model and the language model to form a searched state space for the decoder to perform decoding work. Optionally, fig. 5 shows a schematic diagram of a recognition system provided in the present application, where the recognition system shown in fig. 5 includes the above recognition model and a decoder, and the decoder outputs a sequence corresponding to a speech to be recognized through an acoustic model, a language model and a dictionary.

Optionally, after the to-be-recognized voice of the user is acquired, the to-be-recognized voice is input to the recognition model by the intelligent interaction device, and the to-be-recognized voice is recognized by the recognition model. Since the recognition model includes the hybrid acoustic model, the hybrid language model, and the hybrid dictionary, when processing a speech including a plurality of languages, the acoustic features of the speech to be recognized may be input into the acoustic model corresponding to each language of the hybrid acoustic model, and then decoded in the decoder according to the hybrid modeling unit of the hybrid acoustic model, the hybrid language model, and the hybrid dictionary, so as to obtain a recognition result, that is, the recognition model in the present application may process the speech of the hybrid language.

It should be noted that the above recognition result is used to characterize the result of recognizing the speech to be recognized as text, for example, if the user inputs "i want to go to Shopping and recommend the location of Shopping" to the intelligent interactive device, the intelligent interactive device can recognize the speech of the user as "i want to go to Shopping and recommend the location of Shopping" and make corresponding feedback, for example, output "the location of Shopping".

Based on the schemes defined in the above steps S402 to S404, it can be known that, in a manner that the recognition model is used to recognize the to-be-recognized speech including a plurality of languages, after the speech data including at least one language is acquired, the to-be-recognized speech is recognized through the recognition model, so as to obtain a recognition result. Wherein identifying the model comprises at least: the mixed acoustic model comprises acoustic models of multiple languages, the mixed language model comprises language models of the multiple languages, and the mixed dictionary comprises dictionaries of the multiple languages.

It is easy to note that, because the hybrid acoustic model, the hybrid language model, and the hybrid dictionary in the recognition model in the present application are acoustic models, language models, and hybrid dictionaries of multiple languages, when the speech to be recognized only contains one language, the recognition model in the present application can recognize the speech to be recognized; when the speech to be recognized only contains a plurality of languages, the recognition model in the application can also recognize the speech data containing a plurality of languages, so that the aim of recognizing the speech to be recognized is fulfilled, and the technical effect of recognizing the speech of mixed languages is realized. And then the technical problem that only the voice of a specific language can be identified and the mixed voice cannot be identified in the related technology is solved.

In an optional embodiment, fig. 5 shows a schematic diagram of the recognition system provided by the present application, as can be seen from fig. 5, after the intelligent interaction device obtains voices of multiple languages (i.e., voices to be recognized), the voices to be recognized are input to the recognition model, the recognition model processes the voices to be recognized to obtain acoustic features of the voices to be recognized, and then the acoustic features are processed in a decoder based on the hybrid acoustic model, the hybrid language model, and the hybrid dictionary to obtain target sentences corresponding to the voices to be recognized. The acoustic features of the speech to be recognized include, but are not limited to, Mel-Frequency Cepstrum Coefficient (MFCC), Filter Bank Feature (FBK), and Perceptual Linear Prediction Coefficient (PLP).

It should be noted that the process of processing the speech to be recognized by the recognition model to obtain the acoustic features of the speech to be recognized is a process of preprocessing the speech to be recognized by the recognition model. Alternatively, fig. 6 shows another form of recognition system, which also includes a recognition model and a decoder, as can be seen from fig. 6, the recognition model includes a front-end processing unit and a back-end processing unit (not including a decoder), wherein the front-end processing unit is used for preprocessing the speech to be recognized. Specifically, the recognition model performs endpoint detection on the voice to be recognized to obtain a first voice, then performs noise reduction processing on the first voice to obtain a second voice, and performs feature extraction on the second voice to obtain acoustic features of the voice to be recognized. In fig. 6, the speech to be recognized is present in the form of a speech signal.

Optionally, in the process of preprocessing the speech to be recognized, windowing and framing operations may be performed on the speech to be recognized, where a hamming window may be used for windowing, and the window shift of the hamming window may be 10 ms. After windowing the voice to be recognized, carrying out endpoint detection, noise reduction and feature extraction on the windowed voice to be recognized.

Further, after the voice to be recognized is processed to obtain the acoustic features of the voice to be recognized, the intelligent interaction device processes the acoustic features in the decoder based on the hybrid acoustic model, the hybrid language model and the hybrid dictionary to obtain the target sentence corresponding to the voice to be recognized. Specifically, the acoustic features are processed based on the hybrid acoustic model to obtain a modeling unit corresponding to the acoustic features, then a processing result of the modeling unit for processing the acoustic features is obtained, a word corresponding to the processing result is determined in a decoder based on the hybrid dictionary, and finally the word is processed based on the hybrid language model to obtain a recognition result.

Optionally, the recognition model inputs the acoustic features into the acoustic model corresponding to each language in the mixed acoustic model, then obtains the probability that the acoustic model corresponding to each language outputs the modeling unit of the corresponding language, and determines the modeling unit corresponding to the acoustic features according to the probability. For example, the speech to be recognized includes two languages, that is, chinese and english, after obtaining the acoustic features, the acoustic features corresponding to the two languages are respectively input into the acoustic model corresponding to each language in the hybrid acoustic model, for example, the hybrid acoustic model includes 10 acoustic models, then the probability of each acoustic model outputting the modeling unit is calculated, the acoustic models are sorted according to the size of the probability, and the modeling units corresponding to the preset number of acoustic models with the largest probability are selected to form the modeling unit corresponding to the acoustic features. The preset number may be the same as the number of languages included in the speech to be recognized, for example, the preset number is 2 when the preset number includes two languages included in the speech to be recognized.

In an alternative embodiment, the intelligent interactive device first needs to train the hybrid acoustic model before processing the speech to be recognized based on the hybrid acoustic model. Specifically, first, voice data including a plurality of languages is acquired, and then acoustic features are extracted from the voice data. And inputting the acoustic features into the acoustic model corresponding to each language, and finally training the mixed acoustic model based on the acoustic model corresponding to each language and the voice data to obtain the mixed acoustic model.

The illustration is made by taking a schematic diagram of the hybrid acoustic model shown in fig. 7 as an example, wherein the speech to be recognized includes two languages, i.e., chinese and english, and the hybrid acoustic model includes at least a chinese acoustic model and an english acoustic model. First, a specific acoustic model is optimized by using a specific language, for example, in fig. 7, the acoustic model in chinese is optimized, and the acoustic model in english is optimized. After obtaining the acoustic features, inputting the acoustic features corresponding to the two languages into the chinese acoustic model and the english acoustic model, and then processing the corresponding acoustic features through the neural network hidden layer of the acoustic model of each language to obtain corresponding modeling units, as shown in fig. 7, and finally obtaining the chinese modeling unit and the english modeling unit. The Chinese modeling unit adopts characters, and the English modeling unit adopts words of English words, namely wordpience.

In the foregoing process, the acoustic model of the corresponding language included in the hybrid acoustic model may be, but is not limited to, a GMM-HMM acoustic model, where the GMM model is used to model distribution of acoustic features in the speech to be recognized, and the HMM model is used to model timing information of the speech to be recognized. Fig. 14 is a schematic diagram of an alternative acoustic model, wherein the acoustic model shown in fig. 14 is a chinese acoustic model. Firstly, after obtaining the acoustic features of the speech to be recognized, the acoustic models obtain the distribution of the acoustic features through the GMM model, then according to the distribution of the acoustic features, the HMM model processes the acoustic features to obtain the initial and final sequences corresponding to the acoustic features, such as the initial and final sequences in fig. 15, and then according to the initial and final sequences, the word sequences are obtained, as shown in fig. 15. In the initial and final sequence of fig. 15, the numbers indicate the tones of the initial and final, for example, "ke 1" indicates that the tone of "ke" is one, and "xue 2" indicates that the tone of "xue" is two.

In addition, it should be noted that in the process of modeling the acoustic model, different languages correspond to different modeling units, for example, the modeling units for chinese recognition include, but are not limited to, voiced vowels, and the modeling units for english recognition include, but are not limited to, phonemes, syllables, and triphones.

In an alternative embodiment, the recognition model further processes the words based on the mixed language model to obtain the recognition result. Specifically, firstly, the words are processed based on the mixed language model to obtain a plurality of sentences corresponding to the speech to be recognized, and then the target sentences corresponding to the speech to be recognized are determined from the plurality of sentences based on the optimal path search mode to obtain the recognition result. For example, as shown in fig. 8, in the process of processing a word by using the hybrid language model, the word to be processed by the hybrid language model is "nixianzaigenshenme", the corresponding conversion result is shown in fig. 8, as shown in fig. 8, a complex network structure is formed between nodes, the conversion result can be obtained from any path from the beginning to the end, the hybrid language module can select the most suitable result from multiple paths as a target statement in combination with the context, and as shown in fig. 8, a statement corresponding to the first path (i.e., a statement corresponding to a bold arrow) is selected as the target statement.

Optionally, before the intelligent interaction device processes the speech to be recognized based on the mixed language model, the mixed language model needs to be trained first. Specifically, the text data corresponding to each language is obtained, the language models corresponding to the languages are trained respectively based on the text data corresponding to each language, and finally the language models corresponding to each language are interpolated to obtain the mixed language model.

The description will be made by taking a schematic diagram of the hybrid language model shown in fig. 9 as an example, where the speech to be recognized includes two languages, i.e., chinese and english, and the hybrid language model includes at least a chinese language model and an english language model. First, a specific language model is trained using a specific language, for example, as shown in fig. 9, a chinese language model is trained using chinese, and an english language model is trained using english. After the language model corresponding to each language is obtained, interpolation processing is carried out on the language models, and then the mixed language model can be obtained.

It should be noted that the training of the hybrid language model and the hybrid acoustic model are relatively independent, and after each model is trained, the hybrid language model and the hybrid acoustic model may be combined by a decoder to obtain an output sequence that best matches the speech to be recognized, where a decoding Algorithm of the decoder may be, but is not limited to, a Viterbi Algorithm (Viterbi Algorithm).

Further, after the identification result is obtained, the intelligent interaction device can also obtain the identification result and generate feedback information corresponding to the identification result. The recognition result is used for representing the result of recognizing the speech to be recognized as the character. For example, a user inputs voice "i want to go to cropping and recommend an address where cropping can be performed nearby" to the intelligent interaction device, the intelligent interaction device obtains a recognition result after recognizing the voice of the user, extracts key information (for example, keywords) from the recognition result, generates an instruction according to the keywords, acquires data corresponding to the instruction from the internet or other channels according to the instruction, and feeds the data back to the user, for example, the address where cropping can be performed by an accessory is displayed on a display screen of the intelligent interaction device, or the intelligent interaction device broadcasts the address in a voice form. Optionally, the feedback information includes at least one of: voice information, text information, picture information, and video information.

In addition, it should be noted that, in the prior art, when recognizing voices of multiple languages, if the language prior information is unknown, a special voice recognition system trained in different languages is used to recognize the voices, and the recognition result with the highest score is used as the final recognition result. However, this method requires decoders of multiple languages for decoding, which results in a large amount of consumed computing resources and cannot process mixed-reading speech, for example, the first half of the speech is chinese and the second half is english. If language prior information is known, each output layer corresponds to a language by constructing an acoustic model that includes a plurality of output layers. The language information of the speech to be recognized is predicted while training the model in the speech recognition system, and then the corresponding recognition system is adopted for decoding according to the predicted language information. In this way, the prediction accuracy of the predicted language information is low, and if the language information is predicted incorrectly, the recognition result will be incorrect, thereby reducing the recognition rate of the speech recognition system.

The scheme provided by the application can well utilize the gains brought by the initialization of the trained acoustic model of a single language, the whole system is further optimized through the mixed data and mixed modeling unit, the recognition efficiency and the recognition precision of the system in the process of recognizing the specific language are ensured, and meanwhile, the scheme provided by the application can also give consideration to multi-language mixed reading voice. Compared with the prior art, the scheme provided by the application does not need prior information of languages in the recognition process, and meanwhile, the recognition performance of the voice recognition system is guaranteed.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method for recognizing language according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

Example 2

According to an embodiment of the present application, there is also provided a speech recognition method, as shown in fig. 10, the method including the steps of:

step S1002, a speech to be recognized is input, where the speech to be recognized is speech data including at least one language.

Optionally, the user may input a voice to be recognized to the intelligent interaction device, where the intelligent interaction device is a device capable of interacting with the voice, for example, a subway ticket buying machine, a voice conversation robot, and a voice assistant. The intelligent interaction device has a voice capture device, which may be, but is not limited to, a microphone.

It should be noted that the speech to be recognized may include one language, for example, the speech to be recognized is a chinese speech, and may also include a plurality of languages. Alternatively, in the case where the speech to be recognized contains only one language, the speech to be recognized may also include speech data of different dialects of the same language, for example, in a chinese speech, including both cantonese and northeast.

Step S1004, outputting feedback information corresponding to a recognition result of the speech to be recognized, where the recognition result is a result obtained by recognizing the speech to be recognized by the recognition model, and the recognition model at least includes: the mixed acoustic model comprises acoustic models of multiple languages, the mixed language model comprises language models of the multiple languages, and the mixed dictionary comprises dictionaries of the multiple languages.

It should be noted that the intelligent interaction device has an output unit, wherein the output unit includes, but is not limited to, a voice output unit and a display unit, the voice output unit includes, but is not limited to, a speaker, and the display unit includes, but is not limited to, a display screen, an LED, and the like. The feedback information includes at least one of: for example, if the user inputs a voice "i want to go to Shopping and recommend Shopping", the intelligent interaction device performs voice broadcast on the Shopping place "through the voice output unit, and displays relevant information (e.g., location, path, etc.) of the Shopping place" through the display unit.

In addition, it should be further noted that an Acoustic Model (AM) is used to determine the probability that the text sequence emits the speech to be recognized after determining the text sequence, a Language Model (LM) is used to predict the probability that a character and/or word sequence will be generated, and a decoder outputs a sequence corresponding to the speech to be recognized through the Acoustic Model, the Language Model and a dictionary, wherein the sequence corresponding to the speech to be recognized includes, but is not limited to, characters and words; the hybrid dictionary is a pronunciation dictionary that includes mappings from words to phonemes and is used to determine a mapping relationship between the modeling unit of the acoustic model and the modeling unit of the language model, so as to connect the acoustic model and the language model to form a searched state space for the decoder to perform decoding work. Optionally, fig. 5 shows a schematic diagram of a recognition system provided in the present application, where the recognition system shown in fig. 5 includes the above recognition model and a decoder, and the decoder outputs a sequence corresponding to a speech to be recognized through an acoustic model, a language model and a dictionary.

Optionally, after the to-be-recognized voice of the user is acquired, the to-be-recognized voice is input to the recognition model by the intelligent interaction device, and the to-be-recognized voice is recognized by the recognition model. Since the recognition model includes the hybrid acoustic model, the hybrid language model, and the hybrid dictionary, when processing a speech including a plurality of languages, the acoustic features of the speech to be recognized may be input into the acoustic model corresponding to each language of the hybrid acoustic model, and then decoded in the decoder according to the hybrid modeling unit of the hybrid acoustic model, the hybrid language model, and the hybrid dictionary, so as to obtain a recognition result, that is, the recognition model in the present application may process the speech in the hybrid language.

Based on the schemes defined in steps S1002 to S1004, it can be known that, in a manner that the recognition model is used to recognize the to-be-recognized speech including multiple languages, after the speech data including at least one language is acquired, the to-be-recognized speech is recognized through the recognition model, so as to obtain a recognition result. Wherein identifying the model comprises at least: the mixed acoustic model comprises acoustic models of multiple languages, the mixed language model comprises language models of the multiple languages, and the mixed dictionary comprises dictionaries of the multiple languages.

It should be noted that the intelligent interaction device in this embodiment may further execute the speech recognition method provided in embodiment 1, and related contents are already described in embodiment 1 and are not described herein again.

Example 3

According to an embodiment of the present application, there is also provided a speech recognition system for implementing the speech recognition method, as shown in fig. 11, the system including: input section 1101, recognition section 1103, and output section 1105.

The input unit 1101 is configured to acquire a speech to be recognized, where the speech to be recognized is speech data including at least one language; a recognition unit 1103, configured to recognize a speech to be recognized based on a recognition model to obtain a recognition result, where the recognition model at least includes: the mixed acoustic model comprises acoustic models of a plurality of languages, the mixed language model comprises language models of the plurality of languages, and the mixed dictionary comprises dictionaries of the plurality of languages; an output unit 1105 is configured to output feedback information corresponding to the recognition result.

Optionally, the input unit includes, but is not limited to, a voice capture device, which may be, but is not limited to, a microphone. The output unit includes, but is not limited to, a voice output unit including, but not limited to, a speaker, and a display unit including, but not limited to, a display screen, an LED, and the like. The feedback information includes at least one of: for example, if the user inputs a voice "i want to go to Shopping and recommend Shopping", the intelligent interaction device performs voice broadcast on the Shopping place "through the voice output unit, and displays relevant information (e.g., location, path, etc.) of the Shopping place" through the display unit.

In an optional embodiment, after the to-be-recognized voice of the user is acquired, the intelligent interaction device inputs the to-be-recognized voice to the recognition model, and the recognition model recognizes the to-be-recognized voice. Since the recognition model includes the hybrid acoustic model, the hybrid language model, and the hybrid dictionary, when processing a speech including a plurality of languages, the acoustic features of the speech to be recognized may be input into the acoustic model corresponding to each language of the hybrid acoustic model, and then decoded in the decoder according to the hybrid modeling unit of the hybrid acoustic model, the hybrid language model, and the hybrid dictionary, so as to obtain a recognition result, that is, the recognition model in the present application may process the speech in the hybrid language.

Therefore, after the voice data containing at least one language is obtained, the recognition model is used for recognizing the voice to be recognized, so that a recognition result is obtained. Wherein identifying the model comprises at least: the mixed acoustic model comprises acoustic models of multiple languages, the mixed language model comprises language models of the multiple languages, and the mixed dictionary comprises dictionaries of the multiple languages.

Optionally, the speech recognition system further includes a preprocessing unit, where the preprocessing unit performs endpoint detection on the speech to be recognized to obtain a first speech, performs noise reduction on the first speech to obtain a second speech, and performs feature extraction on the second speech to obtain an acoustic feature of the speech to be recognized, where the recognition unit recognizes the speech to be recognized according to the acoustic feature.

It should be noted that, the acoustic features can be obtained through the preprocessing unit, and in the process of preprocessing the speech to be recognized, the accuracy of recognizing the speech to be recognized can be improved by performing endpoint detection and noise reduction on the speech to be recognized, and the recognition efficiency can be improved.

Further, the speech recognition system further comprises a decoder. After the acoustic features of the voice to be recognized are obtained, the acoustic features are processed in a decoder based on the mixed acoustic model, the mixed language model and the mixed dictionary to obtain a target sentence corresponding to the voice to be recognized, and a recognition result is obtained.

It should be noted that the speech recognition system in this embodiment can also execute the speech recognition method provided in embodiment 1, and related contents have been described in embodiment 1 and are not described herein again.

Example 4

According to an embodiment of the present application, there is also provided an apparatus for implementing the above-mentioned speech recognition method, as shown in fig. 12, the apparatus 120 includes: an acquisition module 1201 and a recognition module 1203.

The acquiring module 1201 is configured to acquire a voice to be recognized, where the voice to be recognized is voice data including at least one language; a recognition module 1203, configured to recognize a to-be-recognized speech based on a recognition model to obtain a recognition result, where the recognition model at least includes: the mixed acoustic model comprises acoustic models of multiple languages, the mixed language model comprises language models of the multiple languages, and the mixed dictionary comprises dictionaries of the multiple languages.

It should be noted here that the acquiring module 1201 and the identifying module 1203 correspond to steps S402 to S404 in embodiment 1, and the two modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of the apparatus may be run in the computer terminal 10 provided in the first embodiment.

In an alternative, the identification module comprises: the device comprises a first processing module and a second processing module. The first processing module is used for processing the voice to be recognized to obtain the acoustic characteristics of the voice to be recognized; and the second processing module is used for processing the acoustic features in the decoder based on the hybrid acoustic model, the hybrid language model and the hybrid dictionary to obtain a target statement corresponding to the voice to be recognized.

In an alternative, the second processing module includes: the device comprises a third processing module, a first obtaining module, a first determining module and a fourth processing module. The third processing module is used for processing the acoustic features based on the hybrid acoustic model to obtain a modeling unit corresponding to the acoustic features; the first acquisition module is used for acquiring a processing result of the acoustic feature processed by the modeling unit; a first determining module, configured to determine, in the decoder, a word corresponding to the processing result based on the mixed dictionary; and the fourth processing module is used for processing the words based on the mixed language model to obtain a recognition result.

In an alternative, the fourth processing module includes: a fifth processing module and a second determining module. The fifth processing module is used for processing the words based on the mixed language model to obtain a plurality of sentences corresponding to the speech to be recognized; and the second determining module is used for determining a target sentence corresponding to the voice to be recognized from the sentences based on the optimal path searching mode to obtain a recognition result.

In an alternative, the third processing module includes: the device comprises a first input module, a second acquisition module and a third determination module. The first input module is used for inputting the acoustic features into an acoustic model corresponding to each language in the mixed acoustic model; the second acquisition module is used for acquiring the probability of the acoustic model corresponding to each language outputting the modeling unit of the corresponding language; and the third determining module is used for determining the modeling unit corresponding to the acoustic feature according to the probability.

In an optional aspect, the speech recognition apparatus further includes: the device comprises a third acquisition module, an extraction module, a second input module and a first training module. The third obtaining module is used for obtaining voice data containing a plurality of languages; the extraction module is used for extracting acoustic features from the voice data; the second input module is used for inputting the acoustic features into the acoustic model corresponding to each language; and the first training module is used for training the mixed acoustic model based on the acoustic model corresponding to each language and the voice data to obtain the mixed acoustic model.

In an optional aspect, the speech recognition apparatus further includes: the device comprises a third acquisition module, a second training module and a sixth processing module. The third obtaining module is used for obtaining text data corresponding to each language; the second training module is used for respectively training the language models of the corresponding languages based on the text data corresponding to each language; and the sixth processing module is used for carrying out interpolation processing on the language model corresponding to each language to obtain a mixed language model.

In an optional aspect, the speech recognition apparatus further includes: a fourth acquisition module and a generation module. The fourth obtaining module is used for obtaining a recognition result, wherein the recognition result is used for representing a result of recognizing the speech to be recognized as characters; and the generating module is used for generating feedback information corresponding to the identification result.

Example 5

The embodiment of the application can provide a computer terminal, and the computer terminal can be any one computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.

Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.

In this embodiment, the computer terminal may execute program codes of the following steps in the method for recognizing a language: acquiring a voice to be recognized, wherein the voice to be recognized is voice data containing at least one language; recognizing the speech to be recognized based on the recognition model to obtain a recognition result, wherein the recognition model at least comprises: the mixed acoustic model comprises acoustic models of multiple languages, the mixed language model comprises language models of the multiple languages, and the mixed dictionary comprises dictionaries of the multiple languages.

Alternatively, fig. 13 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 13, the computer terminal a may include: one or more processors 1302 (only one of which is shown), memory 1304, and transmitting means 1306.

The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for recognizing a language in the embodiment of the present application, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, implementing the method for recognizing a language. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located from the processor, and these remote memories may be connected to terminal a through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: acquiring a voice to be recognized, wherein the voice to be recognized is voice data containing at least one language; recognizing the speech to be recognized based on the recognition model to obtain a recognition result, wherein the recognition model at least comprises: the mixed acoustic model comprises acoustic models of multiple languages, the mixed language model comprises language models of the multiple languages, and the mixed dictionary comprises dictionaries of the multiple languages.

Optionally, the processor may further execute the program code of the following steps: processing the voice to be recognized to obtain acoustic characteristics of the voice to be recognized; and processing the acoustic features in a decoder based on the hybrid acoustic model, the hybrid language model and the hybrid dictionary to obtain a target sentence corresponding to the voice to be recognized.

Optionally, the processor may further execute the program code of the following steps: processing the acoustic features based on the mixed acoustic model to obtain a modeling unit corresponding to the acoustic features; acquiring a processing result of the acoustic feature processed by the modeling unit; determining, in the decoder, a word corresponding to the processing result based on the mixed dictionary; and processing the words based on the mixed language model to obtain a recognition result.

Optionally, the processor may further execute the program code of the following steps: processing the words based on the mixed language model to obtain a plurality of sentences corresponding to the speech to be recognized; and determining a target sentence corresponding to the voice to be recognized from the sentences based on the optimal path searching mode to obtain a recognition result.

Optionally, the processor may further execute the program code of the following steps: inputting the acoustic features into an acoustic model corresponding to each language in the mixed acoustic model; acquiring the probability of the acoustic model corresponding to each language outputting the modeling unit of the corresponding language; and determining a modeling unit corresponding to the acoustic feature according to the probability.

Optionally, the processor may further execute the program code of the following steps: acquiring voice data containing a plurality of languages; extracting acoustic features from the speech data; inputting the acoustic features into an acoustic model corresponding to each language; and training the mixed acoustic model based on the acoustic model corresponding to each language and the voice data to obtain the mixed acoustic model.

Optionally, the processor may further execute the program code of the following steps: acquiring text data corresponding to each language; respectively training language models of corresponding languages based on the text data corresponding to each language; and carrying out interpolation processing on the language model corresponding to each language to obtain a mixed language model.

Optionally, the processor may further execute the program code of the following steps: acquiring a recognition result, wherein the recognition result is used for representing a result of recognizing the speech to be recognized as characters; feedback information corresponding to the recognition result is generated.

It can be understood by those skilled in the art that the structure shown in fig. 13 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 13 is a diagram illustrating a structure of the electronic device. For example, the computer terminal a may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in fig. 13, or have a different configuration than shown in fig. 13.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Example 6

Embodiments of the present application also provide a storage medium. Optionally, in this embodiment, the storage medium may be configured to store program codes executed by the speech recognition method provided in the foregoing embodiment.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring a voice to be recognized, wherein the voice to be recognized is voice data containing at least one language; recognizing the speech to be recognized based on the recognition model to obtain a recognition result, wherein the recognition model at least comprises: the mixed acoustic model comprises acoustic models of multiple languages, the mixed language model comprises language models of the multiple languages, and the mixed dictionary comprises dictionaries of the multiple languages.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: processing the voice to be recognized to obtain acoustic characteristics of the voice to be recognized; and processing the acoustic features in a decoder based on the hybrid acoustic model, the hybrid language model and the hybrid dictionary to obtain a target sentence corresponding to the voice to be recognized.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: processing the acoustic features based on the mixed acoustic model to obtain a modeling unit corresponding to the acoustic features; acquiring a processing result of the acoustic feature processed by the modeling unit; determining, in the decoder, a word corresponding to the processing result based on the mixed dictionary; and processing the words based on the mixed language model to obtain a recognition result.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: processing the words based on the mixed language model to obtain a plurality of sentences corresponding to the speech to be recognized; and determining a target sentence corresponding to the voice to be recognized from the sentences based on the optimal path searching mode to obtain a recognition result.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: inputting the acoustic features into an acoustic model corresponding to each language in the mixed acoustic model; acquiring the probability of the acoustic model corresponding to each language outputting the modeling unit of the corresponding language; and determining a modeling unit corresponding to the acoustic feature according to the probability.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring voice data containing a plurality of languages; extracting acoustic features from the speech data; inputting the acoustic features into an acoustic model corresponding to each language; and training the mixed acoustic model based on the acoustic model corresponding to each language and the voice data to obtain the mixed acoustic model.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring text data corresponding to each language; respectively training language models of corresponding languages based on the text data corresponding to each language; carrying out interpolation processing on the language model corresponding to each language to obtain a mixed language model

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring a recognition result, wherein the recognition result is used for representing a result of recognizing the speech to be recognized as characters; feedback information corresponding to the recognition result is generated.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A speech recognition method, comprising:

acquiring a voice to be recognized, wherein the voice to be recognized is voice data containing at least one language;

recognizing the speech to be recognized based on a recognition model to obtain a recognition result, wherein the recognition model at least comprises: the mixed acoustic model comprises acoustic models of a plurality of languages, the mixed language model comprises language models of a plurality of languages, and the mixed dictionary comprises dictionaries of a plurality of languages.

2. The method of claim 1, wherein recognizing the speech to be recognized based on a recognition model to obtain a recognition result comprises:

processing the voice to be recognized to obtain acoustic characteristics of the voice to be recognized;

and processing the acoustic features in a decoder based on the hybrid acoustic model, the hybrid language model and the hybrid dictionary to obtain a target statement corresponding to the voice to be recognized.

3. The method of claim 2, wherein processing the acoustic features in the decoder based on the hybrid acoustic model, the hybrid language model, and the hybrid dictionary to obtain a target sentence corresponding to the speech to be recognized comprises:

processing the acoustic features based on the hybrid acoustic model to obtain a modeling unit corresponding to the acoustic features;

acquiring a processing result of the acoustic feature processed by the modeling unit;

determining, in the decoder, a word to which the processing result corresponds based on the hybrid dictionary;

and processing the words based on the mixed language model to obtain the recognition result.

4. The method of claim 3, wherein processing the words based on the hybrid language model to obtain the recognition result comprises:

processing the words based on the mixed language model to obtain a plurality of sentences corresponding to the speech to be recognized;

and determining a target sentence corresponding to the voice to be recognized from the sentences based on an optimal path searching mode to obtain the recognition result.

5. The method of claim 3, wherein processing the acoustic features based on the hybrid acoustic model to obtain a modeling unit corresponding to the acoustic features comprises:

inputting the acoustic features into an acoustic model corresponding to each language in the mixed acoustic model;

obtaining the probability of the acoustic model corresponding to each language outputting the modeling unit of the corresponding language;

and determining a modeling unit corresponding to the acoustic feature according to the probability.

6. The method of claim 1, further comprising:

acquiring voice data containing a plurality of languages;

extracting acoustic features from the speech data;

inputting the acoustic features into an acoustic model corresponding to each language;

and training the mixed acoustic model based on the acoustic model corresponding to each language and the voice data to obtain the mixed acoustic model.

7. The method of claim 1, further comprising:

acquiring text data corresponding to each language;

respectively training language models of corresponding languages based on the text data corresponding to each language;

and carrying out interpolation processing on the language model corresponding to each language to obtain the mixed language model.

8. The method of claim 1, further comprising:

acquiring the recognition result, wherein the recognition result is used for representing the result of recognizing the speech to be recognized as characters;

and generating feedback information corresponding to the identification result.

9. A speech recognition method, comprising:

inputting a voice to be recognized, wherein the voice to be recognized is voice data containing at least one language;

outputting feedback information corresponding to a recognition result of the speech to be recognized, wherein the recognition result is a result obtained by recognizing the speech to be recognized by a recognition model, and the recognition model at least comprises: the mixed acoustic model comprises acoustic models of a plurality of languages, the mixed language model comprises language models of a plurality of languages, and the mixed dictionary comprises dictionaries of a plurality of languages.

10. The method according to claim 9, wherein the speech to be recognized further comprises speech data of different dialects of the same language.

11. The method of claim 9, wherein the feedback information comprises at least one of: voice information, text information, picture information, and video information.

12. A speech recognition system, comprising:

the voice recognition device comprises an input unit, a recognition unit and a control unit, wherein the input unit is used for acquiring a voice to be recognized, and the voice to be recognized is voice data containing at least one language;

the recognition unit is used for recognizing the speech to be recognized based on a recognition model to obtain a recognition result, wherein the recognition model at least comprises: the mixed acoustic model comprises acoustic models of a plurality of languages, the mixed language model comprises language models of a plurality of languages, and the mixed dictionary comprises dictionaries of a plurality of languages;

and the output unit is used for outputting the feedback information corresponding to the identification result.

13. The system of claim 12, further comprising:

the voice recognition device comprises a preprocessing unit and a recognition unit, wherein the preprocessing unit is used for carrying out endpoint detection on the voice to be recognized to obtain first voice, then carrying out noise reduction processing on the first voice to obtain second voice, carrying out feature extraction on the second voice to obtain acoustic features of the voice to be recognized, and the recognition unit recognizes the voice to be recognized according to the acoustic features.

14. A speech recognition apparatus, comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a voice to be recognized, and the voice to be recognized is voice data containing at least one language;

a recognition module, configured to recognize the speech to be recognized based on a recognition model to obtain a recognition result, where the recognition model at least includes: the mixed acoustic model comprises acoustic models of a plurality of languages, the mixed language model comprises language models of a plurality of languages, and the mixed dictionary comprises dictionaries of a plurality of languages.

15. A storage medium comprising a stored program, wherein the program, when executed, controls an apparatus in which the storage medium is located to perform the speech recognition method according to any one of claims 1 to 8.

16. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to perform the speech recognition method according to any one of claims 1 to 8 when running.