CN112185348A

CN112185348A - Multilingual voice recognition method and device and electronic equipment

Info

Publication number: CN112185348A
Application number: CN202011119841.6A
Authority: CN
Inventors: 刘博卿; 王健宗; 张之勇; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2021-01-05
Anticipated expiration: 2040-10-19
Also published as: WO2021179701A1; CN112185348B

Abstract

The disclosure relates to the technical field of voice recognition, and discloses a multilingual voice recognition method, a device and an electronic device, wherein the multilingual voice recognition method comprises the following steps: acquiring target voice to be recognized; calling a pre-trained acoustic model and a pre-trained multilingual language model to decode the target voice, and acquiring a recognition result search grid of the target voice; calling a plurality of pre-trained monolingual language models to respectively re-score the recognition result search grids, respectively screening out a candidate recognition result of a corresponding language, and respectively determining the probability that the candidate recognition result is the target recognition result of the target voice; and sorting the candidate recognition results according to the sequence of the probability from large to small, and screening the target recognition result from the candidate recognition results with preset digits before ranking. The method and the device can reduce the difficulty of multi-language voice recognition. Similarly, the scheme can be applied to an online inquiry link in digital medical treatment.

Description

Multilingual voice recognition method and device and electronic equipment

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a multilingual speech recognition method and apparatus, and an electronic device.

Background

With frequent international culture collisions, users often mix languages of multiple languages together in voice communication in daily life. For example: "do you see my pen? "this speech mixes Chinese and English together. In the prior art, when recognizing a multilingual speech, a language detection model is used to detect the language type included in the speech, and then the speech recognition is performed based on the language type determined by the detection to obtain a recognition result. Therefore, in the prior art, the accuracy of the multilingual speech recognition is limited by the accuracy of the language detection model, and once the language detection model is wrong, the speech recognition result is also wrong. Therefore, in order to ensure the accuracy of the speech recognition result in the prior art, a high-accuracy language detection model must be trained, and the requirement of the trained high-accuracy language detection model on training data and training time is high, and the difficulty is high.

Disclosure of Invention

The present disclosure provides a multilingual speech recognition method, device and electronic device, which mainly aims to reduce the difficulty of multilingual speech recognition.

In order to achieve the above object, the present disclosure provides a multilingual speech recognition method, including:

acquiring target voice to be recognized;

calling a pre-trained acoustic model and a pre-trained multilingual language model to decode the target voice, and acquiring a recognition result search grid of the target voice;

calling a plurality of pre-trained monolingual language models to respectively re-score the recognition result search grids, respectively screening out a candidate recognition result of a corresponding language, and respectively determining the probability that the candidate recognition result is the target recognition result of the target voice;

and sorting the candidate recognition results according to the sequence of the probability from large to small, and screening the target recognition result from the candidate recognition results with preset digits before ranking.

Optionally, before invoking the pre-trained acoustic model and the pre-trained multilingual language model to decode the target speech, the method further comprises:

performing noise reduction processing on the target voice to obtain noise-reduced target voice;

and performing feature extraction on the target voice subjected to the noise reduction processing to obtain a voice frame sequence of the target voice which is used as the input of the acoustic model.

Optionally, the multilingual language model is pre-trained by a method comprising:

acquiring a first training text corresponding to a first language and a second training text corresponding to a second language;

inputting the first training text and the second training text into the multi-language model together to obtain a first recognition result of the first training text and a second recognition result of the second training text output by the multi-language model;

determining a recognition error of the multilingual language model based on the first recognition result and the second recognition result;

and adjusting the parameters of the multilingual language model by back-propagating the recognition error until the recognition error is less than a preset error threshold.

Optionally, invoking the pre-trained acoustic model and the pre-trained multilingual language model to decode the target speech, including:

calling the acoustic model, taking the voice frame of the target voice as input, and outputting a first probability of the voice frame corresponding to each state and a second probability of mutual transition between the states aiming at each voice frame;

acquiring a third probability for describing word order statistical rules, which is obtained after the multi-language model is pre-trained;

and decoding the target voice based on the first probability, the second probability and the third probability to obtain the identification result search grid.

Optionally, decoding the target speech based on the first probability, the second probability and the third probability to obtain the recognition result search lattice, including:

establishing a state network of the target voice according to the first probability, the second probability and the third probability, wherein nodes in the state network are used for describing voice frames in a single state, and edges in the state network are used for describing transition probabilities among the voice frames in the single state;

and searching in the state network by adopting a Viterbi algorithm to obtain the identification result search grid.

Optionally, the screening out the target recognition result from the candidate recognition results based on the probability includes: and determining the candidate recognition result corresponding to the maximum probability as the target recognition result.

Optionally, the screening out the target recognition result from the candidate recognition results based on the probability includes:

sorting the candidate recognition results according to the sequence of the probability from large to small;

and randomly selecting a candidate recognition result from the candidate recognition results of the preset digits before ranking as the target recognition result.

In order to solve the above problem, the present disclosure also provides a multilingual speech recognition apparatus, including:

the acquisition module is configured to acquire target voice to be recognized;

the decoding module is configured to call a pre-trained acoustic model and a pre-trained multilingual language model to decode the target voice, and obtain a search grid of a recognition result of the target voice;

a re-scoring module configured to call a plurality of pre-trained monolingual language models to re-score the recognition result search grids, respectively screen out a candidate recognition result of a corresponding language, and respectively determine the probability that the candidate recognition result is the target recognition result of the target speech;

and the screening module is configured to sort the candidate recognition results according to the sequence of the probabilities from large to small, and screen the target recognition result from the candidate recognition results with preset digits before ranking.

In order to solve the above problem, the present disclosure also provides an electronic device, including:

a memory storing at least one instruction; and

and the processor executes the instructions stored in the memory to realize the multilingual speech recognition method.

In order to solve the above problem, the present disclosure also provides a computer-readable storage medium having at least one instruction stored therein, where the at least one instruction is executed by a processor in an electronic device to implement the multilingual speech recognition method.

The embodiment of the disclosure decodes the target voice through the acoustic model and the multilingual language model to obtain the recognition result search grid containing the recognition results corresponding to various possible language combinations. And then, re-scoring the recognition result search grid according to the monolingual language model, screening out the most possible recognition results in various languages, namely candidate recognition results, and finally screening out the target recognition result from the obtained multiple candidate recognition results. The method provided by the embodiment of the disclosure avoids the step of detecting the language type during the multilingual speech recognition, thereby avoiding the difficulty caused by training a language detection model, reducing the difficulty of multilingual speech recognition, being applicable to the field of intelligent government affairs, the field of intelligent medical treatment or other fields with speech recognition requirements, and promoting the construction of a smart city. Similarly, the scheme can be applied to an online inquiry link in digital medical treatment.

Drawings

Fig. 1 is a flowchart illustrating a multilingual speech recognition method according to an embodiment of the present disclosure.

Fig. 2 is a block diagram of a multilingual speech recognition apparatus according to an embodiment of the present disclosure.

Fig. 3 is a schematic internal structural diagram of an electronic device implementing a multilingual speech recognition method according to an embodiment of the present disclosure.

The objects, features, and advantages of the present disclosure will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the disclosure and are not intended to limit the disclosure.

The present disclosure provides a multilingual speech recognition method. Referring to fig. 1, a flowchart of a multilingual speech recognition method according to an embodiment of the present disclosure is shown. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

In this embodiment, the multilingual speech recognition method includes:

step S1, obtaining target voice to be recognized;

step S2, calling a pre-trained acoustic model and a pre-trained multilingual language model to decode the target voice, and acquiring a recognition result search grid of the target voice;

step S3, calling a plurality of pre-trained monolingual language models to respectively re-score the recognition result search grids, respectively screening out a candidate recognition result of a corresponding language, and respectively determining the probability that the candidate recognition result is the target recognition result of the target voice;

and S4, sorting the candidate recognition results according to the sequence of the probability from large to small, and screening the target recognition result from the candidate recognition results with preset digits before the ranking.

The multilingual speech recognition method provided by the embodiment of the disclosure can be applied to various application scenarios with speech recognition requirements, for example: intelligent customer service and intelligent robot inquiry.

In an embodiment, the multilingual speech recognition method provided by the embodiment of the present disclosure is applied to an intelligent customer service. The intelligent customer service system trains and configures an acoustic model, a multilingual language model, a Chinese language model, an English language model and a Japanese language model in the cloud server in advance. The multi-language model is obtained by training a training text mixed with Chinese, English and Japanese; the Chinese language model is obtained by training only Chinese training texts; the English language model is obtained by training only an English training text; the Japanese language model is a language model obtained by training only Japanese training texts.

In the process that a user establishes a call with the intelligent customer service system through a client (such as a mobile phone and a personal computer), the intelligent customer service system uploads user voice collected through the client to a cloud server, and the acoustic model and the multilingual language model are called to decode the user voice to obtain a recognition result search grid. Calling the Chinese language model to re-score the recognition result search grid, screening out a candidate recognition result R1 in Chinese according to the re-scored search, and simultaneously determining the probability P1 that the candidate recognition result R1 is the true semantic of the user; in parallel, calling the English language model to re-score the recognition result search grid, screening out a candidate recognition result R2 under English, and simultaneously determining the probability P2 that the candidate recognition result R2 is the true semantic of the user; and in parallel, calling the Japanese language model to re-score the recognition result search grid, screening out a candidate recognition result R3 in Japanese, and simultaneously determining the probability P3 that the candidate recognition result R3 is the true semantic meaning of the user.

And the intelligent customer service system sorts the three candidate recognition results according to the descending order of the probability P1, the probability P2 and the probability P3, screens out a target recognition result from the candidate recognition results of the first two ranks, and determines the real semantics of the user.

If the intelligent customer service system determines that the candidate recognition result R2 is the real semantic meaning of the user, the intelligent customer service system may further determine a response to the candidate recognition result R2 according to a preset natural language processing model, convert the response from a text to a voice, and feed back the voice to the user. Therefore, even if English or Japanese is mixed into Chinese during the communication process of the user, the intelligent customer service system can accurately understand the real semantics of the user and further accurately make a response.

In the embodiment of the disclosure, target voice to be recognized is acquired. Specifically, the chat voice of the user can be collected through the microphone, and then the chat voice is used as the target voice to be recognized, and the semantics of the chat voice is determined through the subsequent processing in the embodiment of the disclosure, so that the target recognition result of the target voice is obtained. Further, the chat speech may be translated into speech of a specific language or converted into text based on the obtained target recognition result.

The acquired target speech may include a plurality of languages. For example: the target voice comprises Chinese and English; or, Chinese and Japanese are included; or, Chinese, English, and Japanese.

In the disclosed embodiment, at least one acoustic model is pre-trained, and at least one multilingual language model is pre-trained. And after the target voice is obtained, the acoustic model and the multi-language model are called to decode the target voice, and the recognition result search grid of the target voice is obtained. The recognition result search grid can be regarded as a search space for searching out correct target recognition results, and is a compact data structure containing a plurality of selectable paths. Specifically, the acoustic model and the multilingual language model are called to decode the target voice, the target voice is searched in a state network of the target voice, all searched paths are scored, and then the state network is pruned and screened according to the scored points, so that a more compact recognition result search grid with the included paths closer to the target recognition result is obtained.

The acoustic model is a model for processing original audio data and extracting phoneme information in the audio data. Specifically, a hybrid gaussian model and a hidden markov model (GMM-HMM) may be used as the acoustic model in the embodiments of the present disclosure; deep Neural Networks (DNNs) may also be used as acoustic models.

The multilingual language model refers to a language model obtained by training a plurality of different languages in a training text. When the multi-language speech model is trained, the training is mainly limited to the limited number of languages contained in the training text and the limited number of specific sentences contained in the training text, so that when the multi-language speech model is put into actual speech recognition, the confusion degree of the recognition result search grid obtained by decoding the multi-language speech model is high.

In one embodiment, before invoking the pre-trained acoustic model and the pre-trained multilingual language model to decode the target speech, the method further comprises:

carrying out noise reduction processing on the target voice to obtain the target voice subjected to noise reduction processing;

and performing feature extraction on the target voice subjected to the noise reduction processing to obtain a voice frame sequence of the target voice used as the input of the acoustic model.

In this embodiment, after the target speech is acquired, before the target speech is decoded, noise reduction processing is performed on the target speech to enhance the target speech. And then, extracting the characteristics of the target voice subjected to the noise reduction processing, and converting the target voice subjected to the noise reduction processing from a time domain signal to a frequency domain signal to obtain a voice frame sequence (for example, the FBANK characteristics of the target voice divided by frames) of the target voice used as the input of the acoustic model.

It will be appreciated that, in addition to performing noise reduction processing on the target speech prior to decoding, distortion cancellation processing or other speech pre-processing may also be performed on the target speech to further improve the accuracy of multi-lingual speech recognition overall.

In one embodiment, the multilingual language model is pre-trained by a method comprising:

determining an identification error of the multilingual language model based on the first identification result and the second identification result;

and adjusting the parameters of the multilingual language model by back-propagating the recognition error until the recognition error is smaller than a preset error threshold.

In this embodiment, a multilingual language model is pre-trained using a training text containing two languages.

Specifically, the first training text is in a first language, the second training text is in a second language, and the first language and the second language are different languages. On the premise of ensuring the complete semantic meaning of each sentence, mixing the first training text and the second training text together, and inputting the multi-language model; and then the multilingual language model processes the input text according to the current parameters and outputs a first recognition result of the first training text and a second recognition result of the second training text. At this time, the first recognition result output by the multilingual language model may be erroneous, and the second recognition result output by the multilingual language model may also be erroneous.

Since the correct recognition result of the first training text is determined in advance and the correct recognition result of the second training text is also determined in advance, the recognition error of the multilingual language model can be determined based on the first recognition result and the second recognition result output by the multilingual language model. And then, sequentially adjusting the parameters of the multi-language model according to the sequence from back to front by adopting a back propagation algorithm. And mixing the first training text and the second training text together again and inputting the multi-language model, and continuously circulating until the recognition error of the multi-language model is smaller than the preset error threshold.

For example: the preset error threshold is 5%. A certain amount of Chinese training texts and a certain amount of English training texts are collected in advance, and each sentence in the Chinese training texts is randomly inserted into the English training texts by taking a sentence as a unit, so that mixed training texts mixing Chinese and English are obtained.

And inputting the mixed training text into a language model A to be trained to obtain a first recognition result of the Chinese part and a second recognition result of the English part output by the language model A, and further determining the recognition error of the language model A on the basis. The parameters of language model a are then adjusted by back-propagating the recognition error.

And inputting the mixed training text into the language model A with the parameters adjusted, and continuously circulating until the recognition error of the language model A is less than 5 percent to obtain the multi-language model A which is pre-trained and can recognize the Chinese and English mixed text with certain accuracy.

It should be noted that this embodiment is only an exemplary illustration of the process of pre-training the multilingual language model. It can be understood that, according to the requirements of the application, three or more than three languages of training texts can be mixed together when the multilingual language model is pre-trained. This example should not be construed as limiting the scope of the disclosure in its function and use.

In one embodiment, invoking the pre-trained acoustic model and the pre-trained multilingual language model to decode the target speech includes:

calling an acoustic model to take a voice frame of target voice as input, and outputting a first probability of the voice frame corresponding to each state and a second probability of mutual transition between the states aiming at each voice frame;

acquiring a third probability for describing word sequence statistical rules, which is obtained after a multi-language model is pre-trained;

and decoding the target voice based on the first probability, the second probability and the third probability to obtain a recognition result search grid.

In this embodiment, the decoding mainly refers to obtaining probabilities corresponding to each path through searching in a search space composed of knowledge sources such as an acoustic model, an acoustic context, a language model, and the like, and then selecting an optimal path or a top-ranked N path according to the probabilities. Wherein each path corresponds to an identification result; n is a preset positive integer.

Specifically, the acoustic model integrates the knowledge of acoustics and pronunciation, takes the speech frame sequence of the target speech as input, and outputs a first probability of each state corresponding to the speech frame and a second probability of mutual transition between the states of the speech frame for each speech frame in the speech frame sequence. That is, the first probability describes the probability between a frame and a state; the second probability describes the probability from state to state. Wherein, the state refers to a speech unit with smaller granularity than phonemes in speech processing, and one phoneme corresponds to a plurality of states. The essence of speech recognition is to determine the phoneme and then the word sequence based on the state corresponding to each speech frame, and the determined word sequence is the recognition result.

The multilingual language model, after being pre-trained, has a priori knowledge of the statistical regularity of the word sequences that matches the pre-training process. The multilingual language model thus outputs a third probability for describing the statistical regularity of the word order with the prior knowledge.

In an embodiment, decoding the target speech based on the first probability, the second probability and the third probability to obtain a recognition result search lattice includes:

establishing a state network of the target voice according to the first probability, the second probability and the third probability, wherein nodes in the state network are used for describing voice frames in a single state, and edges in the state network are used for describing the voice frames in each single state;

and searching in the state network by adopting a Viterbi algorithm to obtain a search grid of the identification result.

In this embodiment, the state network of the obtained target voice is established as graph structure data. Specifically, the speech frame is expanded in each state according to the first probability to obtain a corresponding node in the state network. For example: the first probability that the speech frame t is in the first state s1 is a1, the first probability that the speech frame t is in the second state s2 is a2, and the second probability that the speech frame t is in the third state s3 is a 3. The speech frame t is spread in each state to obtain 3 nodes in the state network, i.e. speech frame t-s1, speech frame t-s2 and speech frame t-s 3.

And establishing edges with specific weights among the nodes according to a second probability of mutual transition among the states and a third rule for describing a word order statistical rule, thereby obtaining a state network.

And searching in the state network by adopting a Viterbi algorithm, and scoring the searched path. According to the grade, when the best path is determined, a suboptimal path can be determined. And selecting N paths before ranking, storing the N selected paths by using a DAG (direct acyclic graph) structure, and obtaining an identification result search grid of graph structure data. Specifically, after the N paths are selected, all nodes included in the N paths are selected, the original position of each node in the state network is restored, and edges with specific weights among the selected and restored nodes are reestablished according to the first probability, the second probability and the third probability, so that the recognition result search grid is obtained.

When the Viterbi algorithm is adopted for searching, the following three rules are mainly followed: 1. if the path with the maximum probability passes through a certain point of the state network, the sub-path from the starting point to the point must also have the maximum probability from the beginning to the point; 2. assuming that the ith moment has k states, the k states from the beginning to the i moment have k shortest paths, and the final shortest path must pass through one of the k states; 3. according to the two points, when the shortest path of the (i + 1) th state is calculated, only the shortest path from the beginning to the current k state values and the shortest path from the current state value to the (i + 1) th state value need to be considered.

In the embodiment of the present disclosure, a plurality of monolingual language models are pre-trained, and each monolingual language model performs speech recognition only for a corresponding language. And after the identification result search grid is obtained, calling the plurality of monolingual language models to respectively re-score the identification result search grid. According to the above description, the decoding of the target speech is to search in the state network of the target speech to obtain the search grid of the recognition result on the basis of the prior knowledge of the multi-language model; then, the re-scoring of the monolingual language model can be regarded as that the corresponding candidate recognition result is obtained by searching again in the recognition result search network on the basis of the prior knowledge of the monolingual language model. The implementation process of re-scoring is similar to that of decoding, and the main difference is that the priori knowledge of the multilingual language model describes the word order statistical rule in a language environment in which a plurality of languages are mixed together, and the priori knowledge of the monolingual language model describes the word order statistical rule in a language environment in a specific single language.

After the scoring is repeated, each monolingual language model screens out a most possible recognition result in the corresponding language, namely a candidate recognition result in the corresponding language. Meanwhile, the probability that the candidate recognition result is the target recognition result under the corresponding language is also determined.

For example: the pre-training has 3 monolingual language models, which are a language model B specially used for Chinese speech recognition, a language model C specially used for English speech recognition and a language model D specially used for Japanese speech recognition. After the recognition result search grid is obtained, calling a language model B to re-score the recognition result, and screening out a candidate recognition result R1 with the highest probability of P1 in Chinese; calling the language model C to re-score the language model C, and screening out a candidate recognition result R2 with the highest probability of P2 in English; calling the language model D to re-score the language model D, and screening out a candidate recognition result R3 with the highest probability P3 in Japanese.

In one embodiment, the monolingual language model is a statistics-based ngram language model.

The embodiment has the advantages that the ngram language model based on statistics is adopted as the unilingual language model, the limitation of a machine by adopting a neural network as the unilingual language model is avoided, the processing speed is increased, and the recognition efficiency of online real-time multilingual speech recognition is improved.

In the embodiment of the disclosure, after the multiple candidate recognition results and the corresponding probabilities of the target recognition results are obtained, the obtained multiple candidate recognition results are sorted according to the descending order of the corresponding probabilities, and then the target recognition results are screened from the candidate recognition results with preset digits before ranking. Wherein, the preset number of bits before ranking generally does not include the last bit of ranking.

In an embodiment, the screening of the target recognition result from the candidate recognition results with the preset number of digits before ranking includes: and determining the candidate identification result with the first ranking as the target identification result.

In this embodiment, after the obtained multiple candidate recognition results are ranked in the order of decreasing probability, the candidate recognition result of the first bit in M bits before ranking is determined as the target recognition result, that is, the candidate recognition result corresponding to the highest probability is determined as the target recognition result. Wherein M is a preset positive integer greater than 1.

For example: the preset M is 2. The obtained plurality of candidate recognition results include a candidate recognition result R1 with a probability of P1, a candidate recognition result R2 with a probability of P2, and a candidate recognition result R3 with a probability of P3. Wherein, P1 is less than P2, P1 is greater than P3, and the first 2 bits are the candidate recognition result R2 and the candidate recognition result R1. The candidate recognition result R2 is determined as the target recognition result.

In an embodiment, the screening of the target recognition result from the candidate recognition results with the preset number of digits before ranking includes: and randomly selecting a candidate recognition result from the candidate recognition results of the preset digits before ranking as a target recognition result.

In this embodiment, after the obtained multiple candidate recognition results are sorted in the order of decreasing probability, a candidate recognition result is randomly selected from M bits before the ranking as a target recognition result. Wherein M is a preset positive integer greater than 1.

For example: the preset M is 3. The obtained plurality of candidate recognition results include a candidate recognition result R1 with a probability of P1, a candidate recognition result R2 with a probability of P2, a candidate recognition result R3 with a probability of P3, and a candidate recognition result R4 with a probability of P4. Wherein, P1 is greater than P2, P2 is greater than P3, and P3 is greater than P4, the top 3 bits are the candidate recognition result R1, the candidate recognition result R2, and the candidate recognition result R3. One of the candidate recognition result R1, the candidate recognition result R2, and the candidate recognition result R3 is randomly selected as the target recognition result.

The embodiment has the advantage that by introducing a certain degree of randomness into the control range, the occurrence of similar deviations which continuously occur under the condition that the multilingual speech recognition is not perfect enough and does not cover various application scenes is avoided.

Fig. 2 is a functional block diagram of the multilingual speech recognition apparatus according to the present disclosure.

The multilingual speech recognition apparatus 100 of the present disclosure may be installed in an electronic device. Depending on the implemented functionality, the multilingual speech recognition apparatus may include an acquisition module 101, a decoding module 102, a re-scoring module 103, and a screening module 104. A module of the present disclosure, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

the obtaining module 101 is configured to obtain a target voice to be recognized;

the decoding module 102 is configured to call a pre-trained acoustic model and a pre-trained multilingual language model to decode the target speech, and obtain a search grid of a recognition result of the target speech;

the re-scoring module 103 is configured to call a plurality of pre-trained monolingual language models to re-score the recognition result search grids, respectively screen out a candidate recognition result of a corresponding language, and respectively determine the probability that the candidate recognition result is the target recognition result of the target speech;

the screening module 104 is configured to sort the candidate recognition results according to the order of the probabilities from large to small, and screen the target recognition result from the candidate recognition results with preset digits before ranking. Specifically, the functions specifically realized by the function templates of the multilingual speech recognition apparatus 100 can refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not repeated herein.

Fig. 3 is a schematic structural diagram of an electronic device implementing the multilingual speech recognition method according to the present disclosure.

The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a multilingual speech recognition program 12, stored in the memory 11 and operable on the processor 10.

The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of a multilingual speech recognition program, etc., but also for temporarily storing data that has been output or is to be output.

The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by operating or executing programs or modules (e.g., a multilingual speech recognition program, etc.) stored in the memory 11 and calling data stored in the memory 11.

The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.

Fig. 3 only shows an electronic device with components, it will be understood by a person skilled in the art that the structure shown in fig. 2 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.

For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.

Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The multilingual speech recognition program 12 stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, enable:

acquiring target voice to be recognized;

Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again.

Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the disclosure is not limited to the details of the foregoing illustrative embodiments, and that the present disclosure may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the disclosure being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present disclosure and not for limiting, and although the present disclosure is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present disclosure without departing from the spirit and scope of the technical solutions of the present disclosure.

Claims

1. A multilingual speech recognition method, comprising:

acquiring target voice to be recognized;

2. The method of claim 1, wherein prior to invoking the pre-trained acoustic model and the pre-trained multilingual language model to decode the target speech, the method further comprises:

3. The method of claim 1, wherein the multilingual language model is pre-trained by a method comprising:

4. The method of claim 1, wherein invoking a pre-trained acoustic model and a pre-trained multilingual language model to decode the target speech comprises:

calling the acoustic model to take the voice frame sequence of the target voice as input, and outputting a first probability of each state corresponding to the voice frame and a second probability of mutual transition between the states aiming at each voice frame;

5. The method of claim 4, wherein decoding the target speech based on the first probability, the second probability, and the third probability to obtain the recognition result search lattice comprises:

6. The method of claim 1, wherein the step of screening the target recognition result from the candidate recognition results with a preset number of digits before ranking comprises: and determining the candidate recognition result with the first ranking as the target recognition result.

7. The method of claim 1, wherein the step of screening the target recognition result from the candidate recognition results with a preset number of digits before ranking comprises: and randomly selecting a candidate recognition result from the candidate recognition results of the preset digits before ranking as the target recognition result.

8. A multilingual speech recognition apparatus, comprising:

the acquisition module is configured to acquire target voice to be recognized;

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a multilingual speech recognition method according to any one of claims 1-7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the multilingual speech recognition method of any one of claims 1 to 7.