CN115132175A

CN115132175A - Voice recognition method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN115132175A
Application number: CN202110736466.8A
Authority: CN
Inventors: 涂眉; 张帆; 巢望礼; 徐小云; 刘松; 胡硕; 文学; 宋黎明; 楼晓雁
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2021-03-25
Filing date: 2021-06-30
Publication date: 2022-09-30

Abstract

The embodiment of the application provides a voice recognition method, a voice recognition device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring a first voice recognition result of a voice to be recognized; acquiring context information and pronunciation characteristic information of a target text unit in a first voice recognition result; and acquiring a second voice recognition result of the voice to be recognized based on the context information and the pronunciation characteristic information, wherein a plurality of steps in the scheme can be realized by an artificial intelligence method. When the method is used for correcting the error of the voice recognition result, more error types can be covered in the correction process due to the combination of the context information and the pronunciation characteristic information of the target text unit, and the accuracy of the correction result is high.

Description

Voice recognition method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a speech recognition method, apparatus, electronic device, and computer-readable storage medium.

Background

In the Speech Recognition (ASR) technology, it is often necessary to correct errors in Speech Recognition results in order to obtain more accurate Speech Recognition results. The existing speech recognition technology for correcting the recognized text has the problems of various speech recognition errors, difficulty in complete coverage and inaccurate corrected result, so that the existing speech recognition method needs to be improved.

Disclosure of Invention

The purpose of this application is to solve at least one of the above technical defects, and the technical solution provided by this application embodiment is as follows:

in a first aspect, an embodiment of the present application provides a speech recognition method, including:

acquiring a first voice recognition result of a voice to be recognized;

acquiring context information and pronunciation characteristic information of a target text unit in a first voice recognition result;

and acquiring a second voice recognition result of the voice to be recognized based on the context information and the pronunciation characteristic information.

In an optional embodiment of the present application, the method further comprises:

acquiring a confusion value of a text unit in a first voice recognition result;

based on the confusion value, a target text unit is determined.

In an alternative embodiment of the subject application, determining the target text element based on the confusion value comprises:

determining a first text unit with a confusion value not less than a first preset threshold value as a target text unit; alternatively, the first and second liquid crystal display panels may be,

and acquiring a corresponding second text unit based on the first text unit and at least one text unit before and/or after the first text unit, and determining the second text unit as a target text unit.

In an optional embodiment of the present application, the obtaining context information of the target text unit in the first speech recognition result includes:

replacing the target text unit in the first voice recognition result with a preset mask to obtain a corresponding input text sequence;

and acquiring context information corresponding to the target text unit based on the input text sequence.

In an optional embodiment of the present application, the obtaining pronunciation feature information of the target text unit in the first speech recognition result includes:

acquiring a first phoneme set of a target text unit;

and acquiring pronunciation characteristic information of the target text unit based on the first phoneme set.

In an optional embodiment of the present application, the obtaining pronunciation feature information of the target text unit based on the first phoneme set includes:

using the phonemes in the first phoneme set as pronunciation characteristic information of the target text unit; alternatively, the first and second liquid crystal display panels may be,

and replacing at least one phoneme in the first phoneme set to obtain a second phoneme set, and taking the phoneme in the second phoneme set as pronunciation characteristic information of the target text unit.

In an optional embodiment of the present application, replacing at least one phoneme of the first phoneme set to obtain a second phoneme set includes:

acquiring at least one candidate replacement phoneme of the phonemes in the first phoneme set and the probability of the candidate replacement phoneme as the replacement phoneme of the corresponding phoneme based on a preset rule;

and replacing the corresponding phoneme by at least one candidate replacing phoneme of which the probability is not less than a second preset threshold value to obtain a corresponding second phoneme set.

In an optional embodiment of the present application, the obtaining a second speech recognition result of the speech to be recognized based on the context information and the pronunciation feature information includes:

acquiring corresponding fusion information based on the context information and the pronunciation characteristic information;

acquiring a predicted text of the target text unit based on the fusion information;

and replacing the target text unit in the first voice recognition result by the predicted text to obtain a second voice recognition result.

In an optional embodiment of the present application, acquiring corresponding fusion information based on the context information and the pronunciation feature information includes:

and fusing the pronunciation characteristic information and the context information by using a multi-head mutual attention mechanism network and taking the pronunciation characteristic information as a request and the context information as a key value to obtain fused information.

In an optional embodiment of the present application, obtaining a predicted text of a target text unit based on the fusion information includes:

and decoding the fusion information to obtain a predicted text of the target text unit.

In an optional embodiment of the present application, decoding the fusion information to obtain a predicted text of the target text unit includes:

and for the current target text unit, decoding the fusion information based on the predicted text of the previous target text unit to obtain the predicted text of the current target text unit.

In an optional embodiment of the present application, obtaining a first speech recognition result of a speech to be recognized includes:

acquiring a speech to be recognized comprising at least two languages, and acquiring at least one candidate speech recognition result aiming at each language;

acquiring a word graph corresponding to each language based on at least one candidate voice recognition result;

and searching paths in the word graphs corresponding to the languages, and taking the text sequence corresponding to the optimal path as a first voice recognition result.

In an optional embodiment of the present application, obtaining a word graph corresponding to each language based on at least one candidate speech recognition result includes:

acquiring a starting time interval and an ending time interval of a text unit in at least one candidate voice recognition result, and acquiring a longest public subsequence corresponding to at least one candidate voice recognition result;

and acquiring word graphs corresponding to all languages based on the starting time interval and the ending time interval of the text unit and the longest common subsequence.

In an optional embodiment of the present application, performing a path search in a word graph corresponding to each language includes:

acquiring the sequence of text units based on the starting time interval and the ending time interval of the text units in the word graph corresponding to each language;

performing path search based on the sequence to obtain a first path set;

and skipping any text unit in the first path by a preset number of text units and then connecting the text unit with the next text unit to obtain a second path set.

acquiring a voice to be recognized comprising at least two languages;

respectively coding each language to obtain corresponding voice characteristics;

and decoding the voice characteristics respectively corresponding to all the languages and the text information characteristics of the voice to be recognized to obtain a first voice recognition result of the voice to be recognized.

In an optional embodiment of the present application, decoding based on speech features respectively corresponding to each language and text information features of a speech to be recognized to obtain a first speech recognition result of the speech to be recognized, includes:

acquiring first decoding characteristics corresponding to each language by using a multi-head mutual attention mechanism network and respectively taking text information characteristics as a request and voice characteristics corresponding to each language as key values;

carrying out linear classification on each first decoding characteristic to obtain a weight coefficient of each first decoding characteristic;

and acquiring a first voice recognition result based on each first decoding characteristic and the corresponding weight coefficient.

In an optional embodiment of the present application, obtaining the first speech recognition result based on each first decoding feature and the corresponding weight coefficient includes:

acquiring second decoding characteristics of the voice to be recognized based on the first decoding characteristics and the corresponding weight coefficients;

obtaining language classification characteristics of the voice to be recognized based on the second decoding characteristics;

and splicing the second decoding features and the language classification features, and acquiring a first voice recognition result based on a splicing result.

In a second aspect, an embodiment of the present application provides a speech recognition method, including:

acquiring to-be-recognized voice containing at least two languages, and acquiring at least one candidate voice recognition result aiming at each language;

performing path search based on the sequence to obtain a first path set;

In a third aspect, an embodiment of the present application provides a speech recognition method, including:

acquiring a voice to be recognized comprising at least two languages;

In an optional embodiment of the present application, decoding based on speech features respectively corresponding to each language and text information features of a speech to be recognized to obtain a first speech recognition result of the speech to be recognized includes:

In a fourth aspect, an embodiment of the present application provides a speech recognition apparatus, including:

the first voice recognition result acquisition module is used for acquiring a first voice recognition result of the voice to be recognized;

the information acquisition module is used for acquiring the context information and pronunciation characteristic information of the target text unit in the first voice recognition result;

and the second voice acquisition module is used for acquiring a second voice recognition result of the voice to be recognized based on the context information and the pronunciation characteristic information.

In a fifth aspect, an embodiment of the present application provides a speech recognition apparatus, including:

the candidate voice recognition result module is used for acquiring the voice to be recognized containing at least two languages and acquiring at least one candidate voice recognition result aiming at each language;

the word graph acquisition module is used for acquiring a word graph corresponding to each language based on at least one candidate voice recognition result;

and the path searching module is used for searching paths in the word graphs corresponding to the languages and taking the text sequence corresponding to the optimal path as a first voice recognition result.

In a sixth aspect, an embodiment of the present application provides a speech recognition apparatus, including:

the system comprises a to-be-recognized voice acquisition module, a recognition processing module and a recognition processing module, wherein the to-be-recognized voice acquisition module is used for acquiring to-be-recognized voice containing at least two languages;

the coding module is used for coding each language respectively to obtain corresponding voice characteristics;

and the decoding module is used for decoding the voice characteristics respectively corresponding to all the languages and the text information characteristics of the voice to be recognized to obtain a first voice recognition result of the voice to be recognized.

In a seventh aspect, an embodiment of the present application provides an electronic device, including a memory and a processor;

the memory has a computer program stored therein;

a processor configured to execute a computer program to implement the method provided in the embodiment of the first aspect or any optional embodiment of the first aspect.

In an eighth aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method provided in the embodiment of the first aspect or any optional embodiment of the first aspect.

The beneficial effect that technical scheme that this application provided brought is:

the context information and the pronunciation characteristic information of the target text unit needing to be corrected in the first voice recognition result of the voice to be recognized are obtained, the corresponding target text unit is corrected by combining the context information and the pronunciation characteristic information, and the second voice recognition result is obtained after correction.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 2a is a flow chart illustrating a speech recognition method according to an example of an embodiment of the present application;

FIG. 2b is a diagram illustrating an example of obtaining a first speech recognition result by a bilingual recovery decoder according to an embodiment of the present application;

FIG. 2c is a diagram illustrating obtaining word graphs in an example of the embodiment of the present application;

FIG. 2d is a diagram illustrating a route search based on a word graph in an example according to an embodiment of the present application;

FIG. 2e is a diagram illustrating a route search based on a word graph in an example according to an embodiment of the present application;

FIG. 2f is a schematic flow chart of a speech recognition method according to an example of the embodiment of the present application;

FIG. 2g is a diagram illustrating a first stage training process according to an embodiment of the present application;

FIG. 2h is a diagram illustrating a second stage training process according to an embodiment of the present application;

FIG. 2i is a network diagram of a multi-coder and converged-language-diversity-based decoder according to an example of an embodiment of the present application;

fig. 3a is a schematic diagram illustrating an example of acquiring fusion information through a mask dual-channel fusion module according to an embodiment of the present application;

FIG. 3b is a diagram illustrating an example of obtaining a pronunciation feature cluster by a pronunciation feature information extraction module according to an embodiment of the present application;

FIG. 3c is a diagram illustrating an example of obtaining pronunciation feature cluster vectors based on a pronunciation prediction model of expert knowledge mining according to an embodiment of the present application;

fig. 3d is a schematic diagram illustrating an example of acquiring fusion information by an information fusion module according to an embodiment of the present application;

fig. 3e is a schematic diagram illustrating an example of decoding the fusion information by a decoding module according to an embodiment of the present application;

FIG. 3f is a diagram illustrating the calculation of a word sequence score in one example of an embodiment of the present application;

FIG. 3g is a schematic flow chart of speech recognition in an example of an embodiment of the present application;

FIG. 3h is a diagram illustrating phoneme-context fusion information in an example of an embodiment of the present application;

FIG. 4 is a schematic diagram of a speech recognition system in an example of an embodiment of the present application;

FIG. 5a is a flowchart illustrating a speech recognition method according to an example of an embodiment of the present application;

FIG. 5b is a diagram illustrating suspicious location detection in an example of an embodiment of the present application;

fig. 6a is a block diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 6b is a block diagram of another speech recognition apparatus according to an embodiment of the present application;

fig. 6c is a block diagram of a structure of another speech recognition apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Error correction systems in existing speech recognition technologies can be broadly classified into rule-based systems, statistical-based systems, and neural network-based systems. Wherein, the rule type system corrects the voice recognition result by using the rule; the statistics-based system utilizes statistical knowledge modeling to correct for possible erroneous identification; the system based on the neural network firstly establishes a neural network system, then trains the neural network system by using the collected data, and finally corrects the voice recognition result by using the trained neural network. Specifically, the rule system has a limited use scenario because the speech recognition errors are various and the rules are difficult to completely cover the errors; statistical-based systems are being replaced by neural network-based systems due to limitations in the data and technology itself; neural network based systems are the most widely used systems at present, but still face the problems of a great variety of speech recognition errors, difficulty in completely covering models, and inconsistency with the context after correction.

Furthermore, some existing schemes combine text information of a speech recognition result with corresponding word pronunciation information, so that various errors caused by the pronunciation information can be effectively processed, but context information is not fully considered, so that the existing schemes cannot be applied to all languages, and the language types are limited. Meanwhile, in the existing scheme, error sample data needs to be generated as training data of the neural network model, and the generated error sample has a great influence on the final result of the model because the neural network model has strong dependence on the data.

Further, the prior art has the following problems:

1. quality problems are as follows: poor correction effect

(1) When recognition errors in the test set do not appear on the training set, the recognition errors can be propagated to the result of the correction module, so that the robustness of the system is not high, which is very common in practical situations;

(2) the user pronunciations may have accents or pronunciation errors which cause recognition errors, and the wrong pronunciation sequences are also wrong and can be also propagated to the result of the correction module;

(3) the original decoder will generate the correction results from beginning to end, with the risk of modifying the originally recognized correct word into the wrong one instead.

2. The speed problem is as follows: the corrected time delay is in positive linear correlation with the length of the recognition result sequence.

When the recognition result is 10 to 20 words, the time delay spent on correcting is about 100 to 200ms, which is slow, and the user experience is seriously influenced.

3. Multi-lingual speech recognition problems: when the speech to be recognized includes multiple languages, if the speech to be recognized is recognized by using a speech recognition method for a specific language, the obtained speech recognition result may be inaccurate, and further the subsequent correction result may be inaccurate.

In view of the above problems, embodiments of the present application provide a speech recognition scheme, which will be described in detail below.

Fig. 1 is a schematic flowchart of a speech recognition method provided in an embodiment of the present application, and as shown in fig. 1, the method may include:

step S101, a first voice recognition result of the voice to be recognized is obtained.

The first speech recognition result is an initial text sequence obtained by recognizing the speech to be recognized, errors may exist in the initial text sequence, and further error correction is needed to obtain a more accurate text sequence, that is, the second speech recognition result in the application. It will be appreciated that the initial text sequence may be derived from various speech recognition methods known in the art.

Step S102, context information and pronunciation feature information of the target text unit in the first voice recognition result are obtained.

The text unit is a further division of the speech recognition result, and the division modes of different language types may be different, for example, one word may be used as one text unit in english, and one word or one word may be used as one text unit in chinese. In the embodiment of the present application, the target text unit is a text unit that is considered to have a possible error in the first speech recognition result and needs to be corrected. It is understood that the first speech recognition result may include a plurality of target text units.

Specifically, after the initial text sequence corresponding to the first speech recognition result is obtained, each target text unit which needs to be corrected is obtained, and context information and pronunciation feature information corresponding to each text unit are obtained.

And step S103, acquiring a second voice recognition result of the voice to be recognized based on the context information and the pronunciation characteristic information.

Specifically, for each target text unit, based on the context information and pronunciation feature information of the target text unit, the target text unit is modified to obtain a corresponding correct text (hereinafter referred to as a predicted text), and the predicted text is used to replace the corresponding target text unit in the first speech recognition result to obtain a modified text sequence, i.e., a second speech recognition result.

According to the scheme provided by the embodiment of the application, the context information and the pronunciation characteristic information of the target text unit needing to be corrected in the first voice recognition result of the voice to be recognized are obtained, the corresponding target text unit is corrected by combining the context information and the pronunciation characteristic information, and the second voice recognition result is obtained after correction.

In an optional embodiment of the present application, if the speech to be recognized includes at least two languages, obtaining a first speech recognition result of the speech to be recognized includes:

acquiring at least one candidate voice recognition result aiming at each language;

When the speech to be recognized contains multiple languages, if the speech to be recognized is recognized by a speech recognition method aiming at a specific language, the obtained first speech recognition result may be inaccurate, and then the subsequently obtained second speech recognition result is also inaccurate.

Specifically, as shown in fig. 2a, taking a case that the to-be-recognized speech includes two languages as an example, the speech recognition method provided in the embodiment of the present application may specifically include the following steps:

step one, judging whether the speech to be recognized contains two known languages, if so, operating step two, otherwise, obtaining recognition output through a corresponding speech recognizer (namely a speech recognition module of a language one in the figure), and obtaining a corresponding first speech recognition result, and then operating step four;

step two, obtaining candidate voice recognition results of recognition output through two voice recognizers (namely, the voice recognition modules of the language one and the language two in the graph) with known languages;

step three, recognizing the output candidate voice recognition result, and obtaining a reconstructed recognition result through a bilingual recovery decoder to obtain a corresponding first voice recognition result, wherein the bilingual recovery decoder of the embodiment of the application can recover the bilingual recognition result through fuzzy boundary column (beam) search when the voice to be recognized contains multiple languages, does not need accurate frame-level or word-level language detection, avoids search errors caused by errors of language detection, does not need accurate timestamp information, and solves the problem 3;

step four, the first voice recognition result is subjected to mask dual-channel fusion to obtain a vector (namely fusion information in the following text) of attention pronunciation content, and a mask dual-channel fusion module is used for positioning a position of a voice recognition error and fusing a pronunciation channel and a context channel after mask, wherein the mask dual-channel fusion module in the embodiment of the application can avoid noise caused by a region of the recognition error, prevent propagation of the recognition error and enable the system to be more robust, and the problem (1) is solved;

and step five, finally outputting the fusion information through a partial decoder to obtain a second voice recognition result of the voice to be recognized, wherein the partial decoder is used for decoding only the area with the wrong voice recognition in a partial decoding mode to generate a correct voice recognition result, so that the correction time delay is shortened, the problem that the originally recognized correct word is modified into the wrong word is avoided, and the problems (3) and (2) are solved.

Specifically, as shown in fig. 2b, the obtaining of the first speech recognition result by the bilingual restoration decoder in fig. 2a may specifically include the following steps:

step one, a speech recognizer inputting two known languages respectively obtains recognition results of the two languages (namely a language 1 speech recognition candidate and a language 2 speech recognition candidate in fig. 2 b), wherein the recognition result of each language comprises one or more candidate speech recognition results;

step two, the candidate speech recognition result obtained in the step one is used for constructing word graphs based on interval alignment, namely the candidate speech recognition result corresponding to each language can construct a corresponding word graph, so that two word graphs can be constructed;

and step three, performing path search on the word graph by adopting K-jump column search, and taking the obtained optimal path as a reconstructed bilingual recognition result to obtain a first voice recognition result.

The method for selecting the optimal path may be: and calculating the score of the language model for each path obtained by path search by using the cross-language model containing the two languages, and selecting the path with the highest score as the optimal path.

The bilingual recovery decoder recovers a bilingual recognition result by fuzzy boundary column (beam) search, has the function of not needing accurate frame-level or word-level language detection, avoids search errors caused by errors of language detection, does not need accurate timestamp information, and can be used in end-to-end speech recognition.

Specifically, the method for constructing the word graph based on interval alignment may include the following steps:

inputting a candidate voice recognition result, and obtaining a longest public subsequence by using a longest public subsequence algorithm;

step two, forming a word graph by the candidate list according to the longest public subsequence;

and step three, the starting time of the time stamp on each word on the word graph is a time interval (namely the starting time interval), the ending time is a time interval (namely the ending time interval), and the time interval is determined by the time information of each word of the original candidate list and an artificially given time interval threshold.

For example, as shown in fig. 2c, for a piece of speech to be recognized, the speech recognizer provides a plurality of candidate speech recognition results (i.e., candidate 1, candidate 2, and candidate 3 in the figure, and other candidates not shown in the figure) for a certain language, and the start time and the end time of each word in each candidate speech recognition result are inaccurate timestamp information. And obtaining the longest public subsequence 'Michael JacksonBook Dangerous To A' according To the first step. The word graph shown in the lower part of fig. 2c is obtained according to the above step two. According to the third step, the time stamp of each word on the word graph is an interval, and the interval is determined by the starting time and the ending time of each candidate voice recognition result and an artificial threshold. For example, the floating time interval of the Word "Jackson" given in fig. 2c is a corresponding time interval between the floating Word Start position (the Start time of the Start time interval, which may be referred to as Buffered Word Start, BWS) and the floating Word End position (the End time of the End time interval, which may be referred to as Buffered Word End, BWE).

As shown in fig. 2d, the upper part is a word diagram corresponding to english, and the lower part is a word diagram corresponding to korean, and since each word has a certain time interval to float (i.e. each word corresponds to a floating time interval), the search path is more flexible. As shown in FIG. 2d, "Michael Jackson" can be searched using the floating time interval

"this path. If the floating time interval is not used, the search path excludes the above path.

performing path search based on the sequence to obtain a first path set;

The first path set and the second path set are finally obtained through path search, and the second path set can be understood as supplement and expansion of the first path set.

Specifically, the path search may also be referred to as a K-hop search, and the K-hop search may specifically include the following steps:

step one, after word graphs of two languages are obtained, sequencing according to the time interval of each word on the word graphs;

determining a search path formed by adjacent nodes, wherein the positions of all the adjacent nodes on the path in the word graph must be adjacent, and searching by adopting the method to obtain a plurality of first paths, namely a first path set; the method for judging that the two nodes are adjacent nodes may be as follows: if and only if no other node C satisfies i) BWS (C) (floating word start position corresponding to node C) is between BWS (A) (floating word start position corresponding to node A) and BWS (B) (floating word start position corresponding to node B); ii) bwe (C) (floating word ending position corresponding to node C) is between bwe (a) (floating word ending position corresponding to node a) and bwe (B) (floating word ending position corresponding to node B), node a and node B may be considered to be adjacently connected.

And step three, expanding on the search paths obtained in the step two, and searching by skipping K words for each word to obtain a plurality of second paths by using the method, namely obtaining a second path set.

As shown in fig. 2e, on the original column search path, the operation of expanding the search path to skip K words may increase the space of the search path (K is 1 in the figure, i.e. 1-column-skipping search path), and it is easier to include the correct search path. For example, if the method is not used to search for a path, the path will not contain "Michael-Jackson-

-Dangerous-

". The value of K may be set empirically before performing the path search, where the path length needs to be considered when setting the value of K to ensure that the path length is not exceeded after skipping K words.

It should be noted that, the above description of obtaining the first speech recognition result of the speech to be recognized takes the example that the speech to be recognized includes two languages, and it can be understood that the above solutions can be generalized to obtaining the first recognition result of the speech to be recognized that includes more than two languages, and the application is not limited to the above example.

If the speech to be recognized includes at least two languages, a recognition error may occur in the speech recognition. The bilingual hybrid recognition system can design neural network models with different architectures, one is based on model design of a recurrent neural network, the other is based on a transformer model, encoders (encoders) of two languages are used for extracting features of the respective languages, then weight coefficients of the two features are obtained through linear classification, then a hybrid feature output is obtained through weight addition, and the hybrid feature output enters a decoder (decoder) for decoding operation.

The performance of the model based on the recurrent neural network is poor, and the requirement of daily application cannot be met; the method mainly focuses on classification optimization of certain languages on an encoder based on a transformer model, does not fully consider classification information of texts, has defects in bilingual distinctiveness, and causes the problem of misrecognition during bilingual conversion in sentences.

1. For the problem of poor performance of the recurrent neural network model, the embodiment of the application uses a multi-encoder and language-difference-fused decoder (multi-encoder and language-aware decoder) based architecture, can combine respective encoders of bilinguals to share one decoder, controls the total parameters and simultaneously improves the overall performance;

2. aiming at the problem that a transform model is only optimized in an encoder module, the embodiment of the application can obtain better bilingual discrimination by utilizing monolingual information obtained by a decoder and language classification characteristics, and simultaneously obtains better model parameters and decoding performance by utilizing language classification information of texts for auxiliary training in an output layer.

The network architecture of the decoder based on the multiple encoders and the fusion language difference, provided by the embodiment of the application, can not only make full use of the data resources of the monolingual and the parameters of the encoders to obtain the characteristics of the monolingual languages, but also can meet the requirement of better bilingual discrimination by only one decoder, so that the problem of bilingual mixed recognition in the same sentence can be solved on the basis of not reducing the monolingual recognition performance.

In an optional embodiment of the present application, if the speech to be recognized includes another method of obtaining the first speech recognition result in at least two languages, the method includes: a method for a multi-coder and language-aware decoder based on multiple-coders. Taking the example that the speech to be recognized includes two languages, the scheme may specifically include the following steps:

firstly, performing feature extraction, speech enhancement, convolution down-sampling and linear conversion on a speech to be recognized in sequence to obtain speech features x (namely low-dimensional speech features), wherein a coder based on a former is used, the coder can capture global and local features of a whole speech sequence, the speech features x are respectively and simultaneously input into two coders corresponding to two languages, and each coder outputs two high-dimensional speech feature keys _i And value _i (formula 2), subscripti represents a language type (such as chinese and english). It should be noted that the speech enhancement step only exists in the training stage, and in practical applications, the encoder may include N serially arranged encoder modules as shown in the figure, where N is an integer greater than or equal to 1.

For example, 2 encoder modules on the left side of fig. 2f, two encoder modules are respectively responsible for the input feature encoding work of respective languages, so that the two encoders respectively use a large amount of data of respective languages for pre-training, and then migrate the parameters to the multi-head attention module of the multi-encoder and the multi-decoder, so that the convergence speed of the multi-encoder system is faster, and the performance optimization is better. After the calculation of the multi-head self-attention and convolution module of the two graph forward network packages, the obtained final output high-dimensional vector output enters the calculation of CTC (connection dominant time Classification) loss through addition, and meanwhile, the two high-dimensional vectors, namely key _i And value, respectively, to be associated to the calculation of multi-headed attention at each decoder level in the subsequent steps (formula 3).

Step two, in the calculation of the decoder, the query is obtained by the layer normalization (formula 1) of the standard multi-head attention layer output s (namely the text information characteristic) in each module _i Where i is a language label (for example, i ═ 1 represents chinese and i ═ 2 represents english), and combines the output keys of the respective encoders in the first step _i And value _i Through a multi-head attention computing mechanism, MHA for short _i Connecting with the residual error of the input s to obtain the output r _i (formula 3), r ₁ And r ₂ Subjected to linear classification and normalized exponential softmax operation (equation 4) (where r ₁ And r ₂ I.e. first decoding features corresponding to different languages, respectively, where w ₁ ，w ₂ Is r ₁ And r ₂ Respectively corresponding weights, bias) to obtain two language-dependent weight coefficients alpha ₁ And alpha ₂ The weighted sum of these two coefficients (equation 5) is used to derive z (i.e., the second decoding feature) into the layer normalization and forward network module. By obtaining the weight coefficient through the unsupervised learning mechanism instead of specifying the hyper-parameter in advance, the system can be more flexibleThe method of skipping between different languages in a sentence is mastered without influencing the recognition of the respective single language. It should be noted that, in practical applications, the decoder (also called a decoder with merged linguistic differences) may include M decoder modules arranged in series as shown in the figure, where M is an integer greater than or equal to 1.

query _i ＝LayerNorm _i (s) (1)

key _i ＝value _i ＝Encoder _i (x) (2)

r _i ＝s+MHA _i (query _i ,key _i ,value _i ) (3)

α ₁ ,α ₂ ＝softmax(r ₁ *w ₁ +r ₂ *w ₂ +bias) (4)

z＝α ₁ *r ₁ +α ₂ *r ₂ (5)

FIG. 2i is a schematic diagram of a multi-encoder and language-aware decoder (multi-encoder and language-aware decoder).

Step three, in the training phase, the decoder output layer has two losses, one is a normal text classification loss, which is obtained by using a linear layer _1 and softmax _1 (normalized exponential function _1) as shown in fig. 2g, and the other is a language classification loss which is obtained by using a linear layer _2 and softmax _2 (normalized exponential function _ 2). The purpose of increasing language loss is to utilize the natural classification capability of the text, so that the auxiliary model can obtain better classification performance during decoding. Meanwhile, in order to calculate the language loss, the input text sequence of the decoder needs to be converted into a language sequence. In the first stage of training, the loss and update parameters are calculated using equation (6), where L _MOL Is the total loss, L _CTC Is CTC loss, La _Attention Is the attention loss, L _lid Is the language classification loss, and lambda is an adjustable parameter with the value of [0,1]And (4) interval.

L _MOL ＝λL _CTC +0.8*(1-λ)La _Attention +0.2*(1-λ)L _lid (6)

Meanwhile, in order to enable the language classification linear layer _2 to still participate in operation during decoding, based on the fact that the dimension difference between the decoder output and the language classification output is large, a dimension extension linear layer _3 is designed and added, as shown in fig. 2h, in order to match the dimension of the decoder output, the dimension of the linear layer _3 is set to be the square root of the dimension of the decoder output, and then the decoder output and the output of the linear layer _3 (i.e. the language classification characteristic) are spliced into the last linear layer _1 and the normalization exponential function to obtain a decoded text output (i.e. the first speech recognition result). The parameters of the added linear layer _3 are initialized arbitrarily first and then obtained when the second stage of training and tuning is performed after the linear layer _2 converges. Fig. 2h is also a framework when the model is decoded.

In an optional embodiment of the present application, the method may further comprise:

acquiring a confusion value of a text unit in a first voice recognition result;

and determining the target text unit based on the confusion value, wherein the target text unit can be determined based on the first text unit of which the confusion value is not less than the first preset threshold value.

Specifically, while the first speech recognition result of the speech to be recognized is obtained, the confusion value of each text unit in the initial text sequence can also be obtained. The larger the confusion value corresponding to the text unit is, the lower the certainty of the speech recognition system is, i.e. the higher the error rate of the text unit is. Therefore, the text units in the first speech recognition result whose confusion value is greater than or equal to the first preset threshold value may be referred to as first text units, and the target text unit may be determined based on these first text units.

Specifically, in the process of obtaining the first speech recognition result, each text unit corresponds to a plurality of initial predicted texts, each initial predicted text corresponds to a probability value, and the initial predicted text with the largest probability value can be used as the final prediction result (i.e., the final text unit corresponding to the first speech recognition result). And the confusion value of each text unit in the first speech recognition result is obtained based on the probability value of the initial predicted text, and the calculation formula can be as follows:

loss _i ＝-log(p _j )

therein, loss _i Confusion value, p, representing the ith text element _j And the probability value of the jth initial predicted text corresponding to the ith text unit is represented, and the probability value is the maximum probability value in the probability values corresponding to the initial texts.

Further, determining a target text element based on the confusion value includes:

determining the first text unit as a target text unit; alternatively, the first and second electrodes may be,

Specifically, the first text unit may be directly taken as the target text unit. On the other hand, when the confusion value given to each text unit in the speech recognition process is considered, a text unit with an error is given a smaller confusion value, especially, a situation that one or more text units immediately before or after the first text unit are given occurs, and then the confusion value of the text unit may be smaller than the first preset threshold value, and thus the text unit is judged not to need to be corrected. In this case, not only the preceding or following text unit that needs to be corrected is not corrected, but also the result of correction (i.e., predicted text) of the preceding text unit of the text unit to which the confusion value is given is inaccurate, and the second result of speech recognition obtained by correction is inaccurate. In this embodiment, if a first text unit is preceded or followed by a text unit whose confusion value is smaller than the first preset threshold, in addition to marking the first text unit as a target text unit, the preceding or following text unit may also be marked as the target text unit, and the preceding or following text unit and the first text unit are merged into a second text unit, and then the target text units included in the second text unit are taken as a whole, that is, the second text unit is taken as a target text unit for subsequent correction. It is understood that only the text units before the first text unit, only the text units after the first text unit, or all the text units before and after the first text unit may be marked, and the number of the specifically marked text units may be determined according to actual needs.

Specifically, for the first speech recognition result, assuming that the number of the included text units is L, the process of acquiring the target text unit may be understood as a process of labeling the target text in the first speech recognition result, and may specifically include:

(1) firstly, sorting the confusion values;

(2) marking the text units corresponding to the first N maximum confusion values, wherein N can be determined by the following formula:

N＝f ^roundup (L/l)

wherein l is a predetermined value, f ^roundup () It is indicated that the values in parentheses are rounded up, so the size of N can be controlled by taking a different value for L, for example, taking 5 for L, N is rounded up by L/5;

(3) setting a first preset threshold value theta, and not considering the text units with the confusion values smaller than theta, namely, the text units with the confusion values not smaller than theta in the first N text units are the first text units;

(4) marking each first text unit and marking the top w of the first text unit _{bf} A unit of text and a rear w _{bh} A text unit, then the position marked with the mark is regarded as a mark group (i.e. the second text unit), wherein w _{bf} ≥0，w _{bh} Is more than or equal to 0. For example, the marked-up text sequence [ A ] is completed ₀ ,B ₁ ,C ₁ ,D ₀ ,E ₁ ,F ₁ ,G ₀ ]Where 1 is the marked position, thenBC is a mark group, EF is a mark group, namely BC and EF are used as target text units, and corresponding corrected texts are obtained in subsequent schemes.

For example, a certain first speech recognition result is "i wa about to have my first dry score ever", wherein the confusion values of the words "about", "to", "dry", "score" are 1.44, 0.11, 1.66, 1.27, respectively, and the confusion values of the words are compared with a first preset threshold, and "about" and "dry" are the first text unit. In the embodiment of the application, the "about" and the "dry" may be directly used as the target text unit, or the "to" after the "about" may be marked as the target text unit, and the "gram" after the "dry" may be marked as the target text unit, so that two second text units, namely, the "about to" and the "dry gram" are obtained, and then the "about to" and the "dry gram" are used as the target text units, and the predicted text corresponding to the target text unit should be the "drag" in the subsequent correction. It is understood that w is equivalent to w in this example _{bf} Set to 0, w _{bh} Is set to 1.

Specifically, first, replacing a target text unit in the first speech recognition result with a preset mask to obtain an input text sequence. Then, the input text sequence is input into a context extraction model, and context information of each mask, that is, context information of each target text unit, is output.

The context extraction model can be any one of the existing language pre-training models, or a new model which is combined by using different neural networks and can achieve the purpose of extracting context information.

For example, supposeThe context extraction model is a BERT model, the target text unit is replaced by a pre-trained preset mask in the BERT model, then the input text sequence obtained after replacement is input into the BERT model, and the context information sequence with the same length as the input text sequence is obtained. And then extracting context information corresponding to each target text unit from the context information sequence according to the position of the target text unit in the input text sequence, and if other text units are separated between the two target text sequence units, adding separators into the context information of the two target text units for subsequent matching and fusion with the corresponding pronunciation characteristic information. Further, assume that the information vector set for the ith tag group (i.e., target text unit) is denoted as C _i ＝(c _i0 ，c _i1 …), after adding the separator information vector s, the final context-extracted output corresponds to context information that can be written as: c ═ f ^concat (C ₀ ，s，C ₁ ，s，…，C _m ) Where m is the number of target text units, f ^concat () Indicating that the vectors in brackets are stitched in a first dimension.

In the model training stage, for the text sequence in the training data, the following labeling method may be adopted to obtain the target text unit:

(1) marking a text unit at any position in the text sequence with a first probability (e.g., 10%);

(2) when any position in the text sequence is marked, marking the next position with a second probability (such as 50%); for the next position, if it is successfully marked, continue marking its next position with a third probability (e.g., 50%), looping until the next position is not marked or the text sequence ends;

(3) discarding text sequences without any tags; and discarding the text sequence with the marked position number exceeding the preset ratio in the total position number by a fourth probability (such as 80%), thereby obtaining the text sequence marked with the target text unit for training.

The first probability, the second probability, the third probability, the fourth probability and the preset proportion can be set according to actual requirements, and the embodiment of the application is not limited.

Finally, the obtained marking positions (i.e. the target text units) can occupy a proper preset proportion (for example, about 30%) of the total length of the text sequence by the marking method. Considering that a unit of text with a longer pronunciation may be incorrectly recognized as a plurality of units of text with a shorter pronunciation (for example, housetop is recognized as a how step), or a plurality of units of text with a shorter pronunciation may be incorrectly recognized as a unit of text with a longer pronunciation, a preset mask may be randomly discarded or added in each mark group with a certain probability (for example, 50%) during the process of extracting the contextual feature information, so as to ensure that the output length is not limited by the input length.

acquiring a first phoneme set of a target text unit;

Specifically, the phoneme of each target text unit in the first speech recognition result may be extracted, generally, one text unit may be split into a plurality of phonemes, and the plurality of phonemes are arranged in a pronunciation sequence, where a set of the phonemes is referred to as a first phoneme set, and pronunciation feature information of each target text unit may be obtained based on the first phoneme set of the target text unit.

Wherein, different extraction tools can be used to extract phonemes from the text unit according to different languages.

Further, acquiring pronunciation feature information of the target text unit based on the first phoneme set, including:

using the phonemes in the first phoneme set as pronunciation characteristic information of the target text unit; alternatively, the first and second electrodes may be,

and deleting or replacing at least one phoneme in the first phoneme set to obtain a second phoneme set, and taking the phoneme in the second phoneme set as pronunciation characteristic information of the target text unit.

Specifically, each phoneme in the first phoneme set of each target text unit may be used as pronunciation characteristic information of the target text unit. On the other hand, in order to take account of pronunciation errors caused by different errors, in the process of acquiring the pronunciation feature information, the method of prior knowledge of pronunciation, error statistics and the like is used for performing certain processing on the first phoneme set of the target text unit (for example, randomly replacing a certain phoneme in the set with a phoneme similar to the pronunciation thereof, or randomly deleting a certain phoneme, or deleting a certain phoneme can also be understood as replacing the phoneme with a null phoneme), while keeping the main pronunciation feature of the target text unit, the pronunciation of the word is blurred, so that a corresponding second phoneme set is obtained, and each phoneme in the second phoneme set is used as the pronunciation feature information of the corresponding target text unit.

For example, taking english as an example, first, the Phoneme of the tagged word (i.e. the target text unit) is searched by using an english word Phoneme searching tool G2P (graph-to-Phoneme), so as to obtain a corresponding first Phoneme set. Then, the method of pronunciation knowledge, error statistics and the like is utilized to add noise to the word phonemes, the main pronunciation characteristics of the word are kept, meanwhile, the pronunciation of the word is blurred, and the corresponding second phoneme set is obtained.

Specifically, a phoneme information vector matrix E is first established _p And the corresponding phoneme is represented by an information vector. It is also assumed that the set of information vectors for the phonemes in the ith marker set (i.e., the target text unit) is denoted P _i ＝(p _i0 ，p _i1 …) (i.e. the first phone set), after adding the delimiter information vector s, the pronunciation feature information can be written as: p ═ f ^concat (P ₀ ，s，P ₁ ，s，…，P _m ) Where m is the number of target text units. Then, relative position information may be added to the pronunciation feature information using sine and cosine position coding. The pronunciation feature information added with the position information is used as the pronunciation feature information extraction result.

Further, in order to make the scheme of the present application not depend too much on pronunciation characteristics, in the training phaseThe pronunciation characteristic information can be subjected to noise adding processing so as to enhance the utilization rate of the context characteristic information. First, a phoneme similarity matrix S is established, wherein S _ij Representing the similarity of the ith and jth phonemes, and S _ij ＝1,S _ij ＝S _ji . Then, calculating a transition probability matrix T of the phoneme according to the similarity matrix:

wherein | V _p And | is the total number of phonemes.

After the transition probability matrix T is obtained, the phonemes of each tag group may be subjected to noise addition by using at least one of the following schemes, that is, the phonemes in the first phoneme set of each target text unit are subjected to noise addition to obtain a second phoneme set:

(1) the phonemes in the first set of phonemes are left unchanged with a fifth probability (e.g., 50%);

(2) adding any random phoneme at any position with a sixth probability (such as 0.1%);

(3) discarding any positional phoneme with a seventh probability (e.g., 0.1%);

(4) for any phoneme, it is kept unchanged with an eighth probability (e.g., 50%), otherwise it is changed to any phoneme using the transition probability matrix T.

The fifth probability, the sixth probability, the seventh probability and the eighth probability can be set according to actual requirements, and the embodiment of the application is not limited.

The context information and the pronunciation feature information of each target text unit can be fused by using an information fusion technology, and specifically, the information fusion can be based on one or a combination of more of a cyclic neural network, a convolutional neural network, an attention mechanism network, and the like.

Specifically, after the context information and the pronunciation feature information of each target text unit are fused, the fused information is decoded to obtain a predicted text corresponding to each target text unit, and the predicted text is the correct text after the target text unit is corrected. And then, replacing the corresponding target text unit in the first voice recognition result by using the corrected predicted text to obtain a corrected text sequence, namely a second voice recognition result.

Specifically, the context information pronunciation feature information of each target text unit can be fused by using an attention mechanism network.

Specifically, a multi-head mutual attention mechanism network is utilized, pronunciation characteristic information C is used as a request, and context information P is used as a key value, and the two parts of information are fused to obtain fused information. The calculation formula of the fusion information F is as follows:

where d is the information feature dimension, W ^Q ，W ^K ，W ^V Are model parameters that can be trained.

Then, a feedforward neural network can be added behind the multi-head mutual attention mechanism network to further integrate the fusion information. Wherein, a residual error mechanism and a normalization mechanism are added to the output of each network.

Specifically, decoding the fusion information to obtain the predicted text of the target text unit includes:

Specifically, the working mode of decoding the fusion information may adopt a self-loop decoding mode, where the input of each loop is the output of the last loop (i.e. the predicted text of the last target text unit) and the fusion information is output as the predicted text of the current target text unit. In particular, the entire decoding process includes multiple decoding cycles until a stop decoding condition is encountered, where the input to each decoding cycle includes the last decoding cycle output. For example, it is specified that the starting decoding input is < s >, the stop decoding condition is < \ s >, the input of the first decoding loop is < s >, and assuming that the output of the first decoding loop is A, the input of the second decoding loop contains A, the output of the second decoding loop is B, the input of the third decoding loop contains B, the output is C, and so on, until the nth output is < \ s >, the decoding is stopped. The final decoding result is ABC …. The decoding may adopt one or a combination of several of a circular neural network, a convolutional neural network, an attention mechanism network, and the like.

For example, the present application may use a decoder in a Transformer as a decoding module. In the self-loop decoding, only the predicted text corresponding to the marked position (i.e., the position where the target text unit is located) is predicted. The same as the separation mode in the context extraction process, a preset separator may be added between two non-adjacent target text units.

In an alternative embodiment of the present application, the pronunciation feature information may be a pronunciation feature cluster, i.e. containing multiple sets of pronunciation features, as shown in fig. 3a, it is assumed that the recognition result information X of the ASR system is "how (-2.3) step (-2.1) is (-0.4) a (-0.6) compound (-0.7) word (-0.9)", where a word is a recognition result, a number in parentheses is confusion value information corresponding to each word, and "how step" is an erroneous recognition result. Then, the acquiring of the fusion information may specifically include the following steps:

step one, a suspicious position detection module is used for detecting a suspicious position (namely a target text unit) in a first identification result;

step two (a), utilizing a pronunciation feature extraction module to extract a sequence X of the word shapes according to the suspicious positions _T Extracting pronunciation information of a suspicious position, establishing a pronunciation feature cluster, and vectorizing the pronunciation feature cluster into a feature cluster vector;

step two (b), the suspicious position recognition result is replaced by a mask "<MASK>", result X after substitution _T ＝"<MASK><MASK>is a compound word ". Using a context information extraction module, according to X _T Extracting context information of the suspicious position to obtain context mask information;

and step three, fusing the context mask information and the pronunciation feature cluster information by using an information fusion module to obtain a fused phoneme-context fusion vector (namely a vector paying attention to pronunciation content, namely fusion information).

The mask dual-channel mechanism provided by the embodiment of the application replaces the words at the suspicious positions with the masks, so that error propagation is avoided in the context channel; in addition, the mask dual-channel fusion module can expand a phoneme sequence into a plurality of similar phoneme sequences, which is helpful for enriching the search space and alleviating recognition errors and error propagation problems caused by accent or pronunciation errors.

Specifically, as shown in fig. 3b, the acquiring of the pronunciation feature cluster vector by the pronunciation feature information extraction module in fig. 3a may specifically include the following steps:

the method comprises the following steps: the sequence of the word shapes X of the suspicious site is processed according to the word shape-to-phoneme module _T Conversion from "how step" to phoneme sequence I _X ＝"HH AW S T EH P"；

Step two: the phoneme sequence I is divided into a plurality of phoneme sequences I through a pronunciation prediction model based on expert knowledge mining _X The method comprises the steps of performing expansion to obtain a feature cluster of a phoneme sequence, wherein the feature cluster comprises a correct pronunciation feature of a suspicious position and other phoneme sequence features similar to the pronunciation of the suspicious position, a pronunciation prediction model mined based on expert knowledge can calculate the probability of each phoneme in a sampling space, the phoneme sequence is expanded into a plurality of similar phoneme sequences, more opportunities are created to find a correct recognition result, the robustness of a system can be enhanced, and in addition, the expert knowledge can be used for dynamically pruning unreasonable phonemes in a path, so that the generated phonemes are more reasonable;

step three: and vectorizing the feature cluster, and converting the feature cluster into a feature cluster vector.

As shown in fig. 3c, a possible pronunciation prediction model based on expert knowledge mining may specifically include the following steps:

step one, aiming at a phoneme sequence I _X Inputting the phoneme sequence into a phoneme prediction model, and acquiring a replacement phoneme at each phoneme position in the phoneme sequence, namely acquiring the state space of each factor in the phoneme sequence. Assume that the current phoneme is I _i The previous selected phoneme is

Calculate the probability distribution of similar phonemes as

I.e. O _i As I _i Wherein O is _i Representing a state space, i.e. a set of phonemes (e.g. when)

For "HH", the probability P (AW | AW, HH) that the current phoneme Ii is "AW" is 0.4, then the model will choose "AW" as the phoneme at position i with a probability of 40%, or moduloThe probability that the type will "AW" be the replacement phoneme for position i is 0.4). In addition, if the probability distribution path violates expert knowledge (e.g., P (AE | S, OW) ═ 0.2, i.e., "AE" has a probability of 0.2 as a replacement phoneme for S, which is less than a second preset threshold), the path is ignored (i.e., the corresponding phoneme is not replaced). Expert knowledge may be a rule predefined by a linguistic expert.

Step two, according to the state space probability

Sampling to obtain the phoneme of the current position

(for example, for "OW", if P (OW | AW, HH) ═ 0.4, then the probability of "OW" being the phoneme of position i is 40%);

step three, for position i equal to 0, P (O) is used ₀ |I ₀ ) De-sampling

And step four, repeatedly sampling for K times to obtain K feature clusters. Where K is associated with the input phoneme length input _ length and may be taken as 5 × input _ length.

In the pronunciation prediction model based on expert knowledge mining according to the embodiment of the present application, based on previously sampled phonemes (which may also be referred to as sampled phonemes) and a current phoneme, a replacement probability (a probability of replacing the current phoneme) corresponding to each phoneme is calculated, unreasonable paths are pruned according to expert knowledge (the phonemes are filtered/screened according to the replacement probability), and similar phonemes are extracted according to the replacement probability after pruning (the phonemes replacing the current phoneme are selected according to the replacement probability).

Specifically, as shown in fig. 3d, the acquiring of the fusion information by the information fusion module in fig. 3a may specifically include the following steps:

step one, regarding a group of phoneme characteristics in the phoneme characteristic cluster as a request, regarding mask context information as a key value, and performing Attention (Attention) calculation operation to obtain context Attention information (weighted context information in a corresponding graph);

and step two, adding the request and the attention information to obtain final phoneme-context fusion information. The fusion information comprises the pronunciation information and the context information of the mask;

and step three, repeating the steps for each group of phoneme characteristics in the phoneme characteristic cluster to finally obtain a phoneme-context fusion information cluster.

The information fusion module can find the most important context information according to the corresponding pronunciation, which is beneficial to correcting the wrongly recognized words based on the truly relevant words, and the correction accuracy is improved.

As shown in fig. 3h, the information fusion module can find important context information in words other than the location of the recognition error, fig. 3h shows an example of phoneme-context fusion information, and each cell represents the importance of the context information of the abscissa to the phoneme information of the ordinate, for example, the cell (HH, compound) represents the importance of "compound" to "HH".

Specifically, as shown in fig. 3e, the acquiring of the fusion information by the partial decoding module in fig. 3a may specifically include the following steps:

step one, inputting one of the fusion information in the fusion information cluster into a decoding module;

step two, in the decoding module, only predicting the result of the MASK part (for "< MASK > is a compound word", the conventional decoding module will predict the whole sentence "housetop is a compound word", and the decoding module only predicts "housetop", namely the MASK part);

step three: and repeating the steps for each fusion information in the fusion information cluster to obtain the prediction results of all mask parts, and then selecting and outputting the final model by utilizing information such as word sequence scores and pronunciation sequence scores. The selection output formula is as follows:

where P (Y | A, θ) is the word sequence score, P (O | X) _T And theta) is the pronunciation sequence score, theta is the model parameter, and A is the phoneme-context fusion information feature.

Further, the following is a further description of the selection of the output formula:

1. for each pronunciation sequence O, a pronunciation sequence score is calculated based on the suspect region and a pronunciation prediction model. For example, when calculating pronunciation sequence O-HH AW S T AH P, when X _T When it is "how step", the probability of each phoneme is calculated using a knowledge-based phoneme model (pronunciation prediction model based on expert knowledge mining)

Then, the probabilities of all phonemes are added to obtain a pronunciation sequence score P (O-HHAWS TAHP | X) _T ＝how step,θ)。

2. For each phoneme-context fusion information a, a word sequence score P (Y | a, θ) is calculated for each sequence using the Transformer module (as shown in fig. 3 f), and then the highest score max { P (Y | a, θ) } is chosen as the final word sequence score.

3. For each pronunciation sequence O and the corresponding phoneme-context fusion information a, the word sequence score and the pronunciation sequence score are multiplied as their final scores.

4. The final best output Y is determined using certain rules. For example, the highest score is used as the picking rule.

In the Transformer module, for a word sequence Y and the phoneme-context fusion information A corresponding to the word sequence Y, the word Y is in accordance with the i-1 th word Y _i-1 To predict the ith word y _i The steps are as follows:

1. will y _i-1 Inputting into a self-attention layer to obtain a self-attention vector;

2. inputting the self-attention vector and the phoneme-context fusion information into a mutual attention layer to obtain a mutual attention vector;

3. passing through softmax layer to obtain the ith word y _i 。

In the above partial decoding module of the embodiment of the present application, for each pronunciation sequence (O), the pronunciation sequence is based on the previous word (e.g. y) _i-1 ) And phoneme-context fusion information, and special labels (e.g. for example)<M>) Decoding predictions word by word, e.g. given special labels<M>As y ₀ Predicting the word y ₁ May be "housetop", according to y ₁ And phoneme-context fusion information, prediction y ₂ May be "<\s>", this is an end marker, detection of which can end the prediction, yielding an output Y of"<M>housetop<\s>”。

The partial decoding module of the embodiment of the application acquires the phoneme-context fusion information from the information fusion module, and then inputs the phoneme-context fusion information into the Transformer module to generate a plurality of candidate texts of the target text unit.

After the above steps are integrated, as shown in fig. 3g, after a segment of audio is recognized by an ASR (Automatic Speech Recognition) system, a "how step is a compound word" is output, where "how step" is a result of erroneous Recognition. Firstly, detecting an error position by a suspicious position detection module according to an overall identification result; then, extracting pronunciation characteristics of wrong positions (pronunciation characteristics of 'how step') by using a pronunciation characteristic extraction module, masking suspicious positions and extracting context mask information by using a context information extraction module; then, fusing the two parts of information by using an information fusion module to obtain a phoneme-context fusion vector; finally, the output of the suspicious position is predicted again through a partial decoding module (the how step is predicted again to be the housetop); finally, filling the output after being re-predicted into the suspicious position through mask filling to obtain the final output (the housetop a compound word).

The voice recognition method provided by the embodiment of the application can be applied to intelligent terminals (such as mobile phones), intelligent household equipment, automobiles, intelligent earphones and the like.

For example, for an application program that needs to use a voice recognition function and is set in the above-mentioned devices such as a smart terminal (e.g., a mobile phone), a smart home device, an automobile, and a smart headset, for example, a voice assistant, the scheme of the present application may provide an error correction scheme for the voice recognition function in the voice assistant. Specifically, the exact content corresponding to the user's voice should be "where is Enlo Hospital? "there is a possibility that a result of erroneous recognition" where is in to hospital? Through the technical scheme of the embodiment of the application, the error recognition result can be corrected, and a correct sentence is obtained: "where is enlo hospital" and then input into the semantic understanding module of the voice assistant. In the dictation function of an intelligent terminal (such as a mobile phone), the technical scheme of the embodiment of the application can also be utilized to correct the voice recognition result. For example, when a user says "I like mojito of zhou jilun", the dictation function may recognize the voice error as "I like choujay demo hit", and the "I like zhou jilun demo hit" is obtained after the bilingual mixing recognition according to the technical scheme of the embodiment of the present application, and then a correct sentence is obtained by using the error correction module according to the technical scheme of the embodiment of the present application.

To sum up, the speech recognition method provided by the embodiment of the present application may be implemented by a speech recognition system as shown in fig. 4, where the speech recognition system mainly includes: a first speech recognition result obtaining module 200, a Suspicious Position Detection (SPD) module 201, a context information extraction module 202, a pronunciation feature information extraction module 203, an information fusion module 204, an information decoding module 205, and a second speech recognition result obtaining module 206. The first speech recognition result obtaining module 200 may be any existing speech recognition system, and the system is configured to perform initial recognition on a speech to be recognized to obtain an initial text sequence, then input the initial text sequence to the suspected position detecting module 201, and mark, in the suspected position detecting module 201, a target text unit in the initial text sequence based on the method described above, that is, obtain an input text sequence marked with the target text unit. Next, the input text sequence is input into the context information extraction module 202, a corresponding context text sequence is output, each target text unit in the input text sequence is input into the pronunciation feature information extraction module 203, corresponding pronunciation feature information is output, the context text sequence and the pronunciation feature information are simultaneously input into the information fusion module 204, the context information and the pronunciation feature information are fused in the information fusion module 204, and corresponding fusion information is output. The output fusion information is input into the information decoding module 205, and the information decoding module 205 outputs the predicted text corresponding to each target text unit after decoding. The output predicted text is input into the second speech recognition result obtaining module 206, the corresponding target text unit is replaced in the second speech recognition result 206, and the corrected second speech recognition result is output.

In the following, the scheme of the present application is further explained by an example, as shown in fig. 5a, in the speech recognition system provided in the present application, a certain first speech recognition result is "i was about to have my first dry score ever", where the confusion values corresponding to the text units are "0.27, 0.16, 1.44, 0.11, 0.52, 0.12, 0.34, 1.66, 1.27, 0.71", respectively, and the first preset threshold is 1.30. And marking the target text units in the initial text sequence as 'about to' and 'dry gram' respectively by a suspicious position detection module (SPD) based on the confusion value and the first preset threshold. Then replacing the target text unit in the text sequence marked by the target text unit with a preset MASK [ MASK ]]Obtaining the input text sequence' i was [ MASK ]][MASK]have my first[MASK][MASK]ever ", for BERT model, an identification tag [ CLS ] can be added in front of it]Then the input text sequence becomes "[ CLS ]]i was[MASK][MASK]have my first[MASK][MASK]ever ", then the vectorization processing (Embedding) is carried out on the ever" to convert the ever "into the input" x "of the BERT model ₀ x ₁ x ₂ x ₃ x ₄ x ₅ x ₆ x ₇ x ₈ x ₉ x ₁₀ ", input,"After entering the BERT model, outputting a corresponding context text sequence' y ₀ y ₁ y ₂ y ₃ y ₄ y ₅ y ₆ y ₇ y ₈ y ₉ y ₁₀ And then obtaining the context information' y corresponding to the position of the target text unit ₃ y ₄ "and" y ₈ y ₉ ", the preset separator is finally adopted<sep>Separating and connecting to obtain the context information y corresponding to the input text sequence ₃ y ₄ <sep>y ₈ y ₉ ". Meanwhile, inputting two target text units 'about to' and 'dry gram' into a pronunciation feature information extraction module, searching corresponding phoneme sets of the two target text units as 'AH B AW T T T UW' through G2P<sep>D R AY G R EY S', Embedding the phoneme set to obtain the corresponding pronunciation characteristic information "p ₀₀ p ₀₁ p ₀₂ p ₀₃ p ₀₄ p ₀₅ <sep>p ₁₀ p ₁₁ p ₁₂ p ₁₃ p ₁₄ p ₁₅ p ₁₆ ". Then the context information "y ₃ y ₄ <sep>y ₈ y ₉ "and pronunciation feature information" p ₀₀ p ₀₁ p ₀₂ p ₀₃ p ₀₄ p ₀₅ <sep>p ₁₀ p ₁₁ p ₁₂ p ₁₃ p ₁₄ p ₁₅ p ₁₆ The method comprises the steps of inputting information fusion module (wherein position coding information is added to pronunciation characteristic information before inputting), outputting fusion information, inputting the fusion information into the information decoding module, and outputting a predicted text of a target text unit 'about to' and a predicted text of a dry graph 'about to' as 'drag trace'. In addition, as can be seen from the foregoing description, the whole decoding process includes a plurality of decoding loops until a stop decoding condition is met, and therefore, start tags can be added to the first bits of the fused information respectively<S>And stop tag<\S>To provide start and stop conditions for the loop decoding process. And finally, replacing the corresponding target text unit by using the predicted text to obtain a second voice recognition result of' i wa about to have my first drag race ever”。

Wherein, when the confusion value corresponding to the text unit cannot be obtained, the suspicious location detection module can be replaced by the existing label prediction model, as shown in fig. 5 b. For convenience of description, the speech recognition result is denoted as R (i.e., "i wave about to have my first speech score"), and the specific process is as follows:

1. coding the voice recognition result by using a transform coding module (namely TransEnc in the figure), and obtaining coding information H:

H＝TransEnc(R)

2. extracting coding information by using a Linear network (namely Linear in the figure) to obtain the location information L of the label:

L＝W _l H+b

wherein W _l And b is a disciplinable parameter;

3. decoding the location information of the label by using a Conditional Random Field (CRF in the figure) to obtain a suspicious label G corresponding to the text unit:

G＝CRF(L,W _crf )

wherein W _crf Are disciplinable parameters. According to the speech recognition result in the example, finally, the suspicious label G is predicted as "O B I O. Wherein "O" represents that the text unit is not suspect and "B" and "I" represent that the text unit is suspect;

4. and marking the target text units in the initial text sequence as 'about to' and 'dry gram' respectively by using the suspicious labels.

Specifically, the information fusion module utilizes a multi-head mutual attention mechanism network (i.e., Co-MHA in the figure), takes pronunciation characteristic information C as a request (position coding information is added before input), takes the following information P as a key value, fuses two parts of information, and adds a feedforward neural network (i.e., Feed Forward in the figure) behind the multi-head mutual attention mechanism network to further integrate the fusion information to obtain the fusion information. Wherein, a residual error mechanism and a normalization mechanism (i.e. Add & Norm in the figure) are added to the output of each network in the information fusion module. In the information decoding module, for each Self-loop decoding process, firstly, position coding information is added to an input decoder, and then the position coding information is input into a multi-head Self-attention mechanism network (namely Self-MHA in the figure) to obtain Self-attention information. Then, the self-attention information is used as a request, the fusion information is used as a key value, and the key value is input into the multi-head mutual attention mechanism network. A feed-forward neural network is also added later to further integrate the information. And, the output of each network is added with a residual mechanism and a normalization mechanism. And finally, extracting the integrated information by using a Linear network (namely Linear in the graph), giving the probability of each word in the word list by using a softmax network, and selecting the word with the maximum probability as final output to obtain the predicted text of each target text unit.

It should be noted that the information fusion module and the information decoding module in fig. 5a only show the processing process of information fusion and information decoding once, in practical application, the information fusion module may be formed by stacking a plurality of information fusion modules shown in fig. 5a in series, that is, the output processed by the previous information fusion module is used as the input of the next information fusion module, and so on, until the final fusion information is output after the last information fusion module finishes processing; similarly, the information decoding module may also be formed by serially overlapping a plurality of information decoding modules shown in fig. 5a, that is, the output processed by the previous information decoding module is used as the input of the next information decoding module, and so on, until the final information decoding module finishes processing, the final decoding result is output.

Fig. 6a is a block diagram of a structure of a speech recognition apparatus according to an embodiment of the present application, and as shown in fig. 6a, the apparatus 400a may include: a first voice recognition result obtaining module 401a, an information obtaining module 402a, and a second voice recognition result obtaining module 403a, wherein:

the first speech recognition result obtaining module 401a is configured to obtain a first speech recognition result of a speech to be recognized;

the information obtaining module 402a is configured to obtain context information and pronunciation feature information of the target text unit in the first speech recognition result;

the second speech recognition result obtaining module 403a is configured to obtain a second speech recognition result of the speech to be recognized based on the context information and the pronunciation feature information.

According to the scheme provided by the application, the context information and the pronunciation characteristic information of the target text unit needing to be corrected in the first voice recognition result of the voice to be recognized are obtained, the corresponding target text unit is corrected by combining the context information and the pronunciation characteristic information, the second voice recognition result is obtained after correction, more error types can be covered in the correction process due to the combination of the context information and the pronunciation characteristic information of the target text unit, and the accuracy of the correction result is high.

In an optional embodiment of the present application, the apparatus may further comprise a target text unit determining module configured to:

acquiring a confusion value of a text unit in a first voice recognition result;

based on the confusion value, a target text unit is determined.

In an optional embodiment of the present application, the target text unit determining module is specifically configured to:

determining a first text unit with a confusion value not less than a first preset threshold value as a target text unit; alternatively, the first and second electrodes may be,

In an optional embodiment of the present application, the information obtaining module is specifically configured to:

acquiring a first phoneme set of a target text unit;

In an optional embodiment of the present application, the information obtaining module is further configured to:

acquiring at least one candidate replacing phoneme of the phonemes in the first phoneme set and the probability of the candidate replacing phoneme as a replacing phoneme of the corresponding phoneme based on a preset rule;

In an optional embodiment of the present application, the second speech recognition result obtaining module includes an information fusion sub-module, a predictive text obtaining sub-module, and a second speech recognition result obtaining sub-module, where:

the information fusion sub-module is used for acquiring corresponding fusion information based on the context information and the pronunciation characteristic information;

the predicted text acquisition sub-module is used for acquiring the predicted text of the target text unit based on the fusion information;

and the second voice recognition result acquisition submodule is used for replacing the target text unit in the first voice recognition result by using the predicted text to obtain a second voice recognition result.

In an optional embodiment of the present application, the information fusion sub-module is specifically configured to:

and fusing the pronunciation characteristic information and the context information by using the multi-head mutual attention mechanism network and taking the pronunciation characteristic information as a request and the context information as a key value to obtain fused information.

In an optional embodiment of the present application, the predictive text retrieval sub-module is specifically configured to:

In an optional embodiment of the present application, the predictive text acquisition sub-module is further configured to:

In an optional embodiment of the present application, the first speech recognition result obtaining module is specifically configured to:

In an optional embodiment of the present application, the first speech recognition result obtaining module is further configured to:

acquiring a starting time interval and a terminating time interval of a text unit in at least one candidate voice recognition result, and acquiring a longest public subsequence corresponding to at least one candidate voice recognition result;

performing path search based on the sequence to obtain a first path set;

acquiring a voice to be recognized comprising at least two languages;

Fig. 6b is a block diagram of a speech recognition apparatus according to an embodiment of the present application, and as shown in fig. 6b, the apparatus 400b may include: a candidate speech recognition result obtaining module 401b, a word graph obtaining module 402b, and a path searching module 403b, wherein:

the candidate speech recognition result obtaining module 401b is configured to obtain a speech to be recognized including at least two languages, and obtain at least one candidate speech recognition result for each language;

the word graph acquiring module 402b is configured to acquire a word graph corresponding to each language based on at least one candidate speech recognition result;

the path search module 403b is configured to perform path search in the word graph corresponding to each language, and use a text sequence corresponding to the optimal path as a first speech recognition result.

In an optional embodiment of the present application, the word graph obtaining module is specifically configured to:

In an optional embodiment of the present application, the path search module is specifically configured to:

performing path search based on the sequence to obtain a first path set;

Fig. 6c is a block diagram of a speech recognition apparatus according to an embodiment of the present application, and as shown in fig. 6c, the apparatus 400c may include: a to-be-recognized speech acquisition module 401c, an encoding module 402c, and a decoding module 403c, wherein:

the to-be-recognized voice acquiring module 401c is configured to acquire to-be-recognized voices including at least two languages;

the encoding module 402c is configured to encode each language to obtain corresponding speech features;

the decoding module 403c is configured to decode based on the speech features respectively corresponding to the languages and the text information features of the speech to be recognized, so as to obtain a first speech recognition result of the speech to be recognized.

In an optional embodiment of the present application, the decoding module is specifically configured to:

the information acquisition module is used for acquiring context information and pronunciation characteristic information of the target text unit in the first voice recognition result;

and the second voice recognition result acquisition module is used for acquiring a second voice recognition result of the voice to be recognized based on the context information and the pronunciation characteristic information.

Referring now to fig. 7, shown is a schematic diagram of an electronic device (e.g., a terminal device or a server that performs the method shown in fig. 1) 500 suitable for implementing embodiments of the present application. The electronic device in the embodiments of the present application may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), a wearable device, and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

The electronic device includes: a memory for storing a program for executing the method of the above-mentioned method embodiments and a processor; the processor is configured to execute programs stored in the memory. The processor may be referred to as a processing device 501 described below, and the memory may include at least one of a Read Only Memory (ROM)502, a Random Access Memory (RAM)503, and a storage device 508, which are described below:

as shown in fig. 7, electronic device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 7 illustrates an electronic device having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. When executed by the processing device 501, the computer program performs the above-described functions defined in the method of the embodiment of the present application.

It should be noted that the computer readable storage medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

acquiring a first voice recognition result of a voice to be recognized; acquiring context information and pronunciation characteristic information of a target text unit in a first voice recognition result; acquiring a second voice recognition result of the voice to be recognized based on the context information and the pronunciation characteristic information;

or, acquiring a speech to be recognized comprising at least two languages, and acquiring at least one candidate speech recognition result aiming at each language;

performing path search in the word graph corresponding to each language, and taking a text sequence corresponding to the optimal path as a first voice recognition result;

or, acquiring a speech to be recognized containing at least two languages; respectively coding each language to obtain corresponding voice characteristics; and decoding the voice characteristics respectively corresponding to all the languages and the text information characteristics of the voice to be recognized to obtain a first voice recognition result of the voice to be recognized.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules or units described in the embodiments of the present application may be implemented by software or hardware. Where the name of a module or unit does not in some cases constitute a limitation of the unit itself, for example, the first constraint obtaining module may also be described as a "module that obtains first constraints".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The apparatus provided in the embodiment of the present application may implement at least one of the modules through an AI model. The functions associated with the AI may be performed by the non-volatile memory, the volatile memory, and the processor.

The processor may include one or more processors. At this time, the one or more processors may be general-purpose processors, such as a Central Processing Unit (CPU), an Application Processor (AP), or the like, or pure graphics processing units, such as a Graphics Processing Unit (GPU), a Vision Processing Unit (VPU), and/or AI-specific processors, such as a Neural Processing Unit (NPU).

The one or more processors control the processing of the input data according to predefined operating rules or Artificial Intelligence (AI) models stored in the non-volatile memory and the volatile memory. Predefined operating rules or artificial intelligence models are provided through training or learning.

Here, providing by learning means that a predefined operation rule or an AI model having a desired characteristic is obtained by applying a learning algorithm to a plurality of learning data. This learning may be performed in the device itself in which the AI according to the embodiment is performed, and/or may be implemented by a separate server/system.

The AI model may include a plurality of neural network layers. Each layer has a plurality of weight values, and the calculation of one layer is performed by the calculation result of the previous layer and the plurality of weights of the current layer. Examples of neural networks include, but are not limited to, Convolutional Neural Networks (CNNs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), Restricted Boltzmann Machines (RBMs), Deep Belief Networks (DBNs), Bidirectional Recurrent Deep Neural Networks (BRDNNs), generator-confrontation networks (GANs), and deep Q networks.

A learning algorithm is a method of training a predetermined target device (e.g., a robot) using a plurality of learning data to make, allow, or control the target device to make a determination or prediction. Examples of the learning algorithm include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific method implemented by the computer-readable medium described above when executed by the electronic device may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless otherwise indicated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A speech recognition method, comprising:

acquiring a first voice recognition result of a voice to be recognized;

acquiring context information and pronunciation characteristic information of a target text unit in the first voice recognition result;

2. The method of claim 1, further comprising:

acquiring a confusion value of a text unit in the first voice recognition result;

determining the target text unit based on the confusion value.

3. The method of claim 2, wherein determining the target text unit based on the confusion value comprises:

determining a first text unit with a confusion value not less than a first preset threshold value as the target text unit; alternatively, the first and second electrodes may be,

and acquiring a corresponding second text unit based on the first text unit and at least one text unit before and/or after the first text unit, and determining the second text unit as the target text unit.

4. The method of claim 1, wherein obtaining context information of a target text unit in the first speech recognition result comprises:

5. The method of claim 1, wherein obtaining pronunciation characteristic information of a target text unit in the first speech recognition result comprises:

acquiring a first phoneme set of the target text unit;

6. The method of claim 5, wherein the obtaining pronunciation characteristic information of the target text unit based on the first phone set comprises:

taking phonemes in the first phoneme set as pronunciation characteristic information of the target text unit; alternatively, the first and second electrodes may be,

and replacing at least one phoneme in the first phoneme set to obtain a second phoneme set, and using the phonemes in the second phoneme set as pronunciation characteristic information of the target text unit.

7. The method of claim 6, wherein said replacing at least one phone in said first phone set to obtain a second phone set comprises:

8. The method according to claim 1, wherein the obtaining a second speech recognition result of the speech to be recognized based on the context information and the pronunciation feature information comprises:

and replacing the target text unit in the first voice recognition result by using the predicted text to obtain a second voice recognition result.

9. The method according to claim 8, wherein the obtaining corresponding fusion information based on the context information and the pronunciation feature information comprises:

and fusing the pronunciation characteristic information and the context information by using the multi-head mutual attention mechanism network, taking the pronunciation characteristic information as a request and the context information as a key value to obtain the fused information.

10. The method according to claim 8, wherein the obtaining the predicted text of the target text unit based on the fusion information comprises:

11. The method of claim 10, wherein said decoding the fused information to obtain the predicted text of the target text unit comprises:

12. The method according to claim 1, wherein the obtaining a first speech recognition result of the speech to be recognized comprises:

acquiring word graphs corresponding to the languages based on the at least one candidate voice recognition result;

and searching paths in the word graphs corresponding to the languages, and taking the text sequence corresponding to the optimal path as the first voice recognition result.

13. The method according to claim 12, wherein said obtaining a word graph corresponding to each language based on the at least one candidate speech recognition result comprises:

acquiring a starting time interval and a terminating time interval of a text unit in the at least one candidate voice recognition result, and acquiring a longest public subsequence corresponding to the at least one candidate voice recognition result;

and acquiring word graphs corresponding to the languages based on the starting time interval and the ending time interval of the text unit and the longest public subsequence.

14. The method of claim 13, wherein performing a path search in the word graph corresponding to each language comprises:

acquiring the sequence of the text units based on the starting time interval and the ending time interval of the text units in the word graph corresponding to each language;

performing path search based on the sequence to obtain a first path set;

and skipping a preset number of text units by any text unit in the first path and then connecting the text unit with the next text unit to obtain a second path set.

15. The method according to claim 1, wherein the obtaining a first speech recognition result of the speech to be recognized comprises:

acquiring a voice to be recognized comprising at least two languages;

and decoding the voice characteristics respectively corresponding to all languages and the text information characteristics of the voice to be recognized to obtain a first voice recognition result of the voice to be recognized.

16. The method according to claim 15, wherein the decoding based on the speech features respectively corresponding to the languages and the text information features of the speech to be recognized to obtain the first speech recognition result of the speech to be recognized comprises:

acquiring first decoding characteristics corresponding to each language by using a multi-head mutual attention mechanism network and respectively taking the text information characteristics as a request and the voice characteristics corresponding to each language as key values;

and acquiring the first voice recognition result based on each first decoding characteristic and the corresponding weight coefficient.

17. The method of claim 16, wherein obtaining the first speech recognition result based on each first decoding feature and the corresponding weight coefficient comprises:

and splicing the second decoding features and the language classification features, and acquiring the first voice recognition result based on a splicing result.

18. A speech recognition method, comprising:

and searching paths in the word graphs corresponding to the languages, and taking a text sequence corresponding to the optimal path as the first voice recognition result.

19. The method according to claim 18, wherein said obtaining a word graph corresponding to each language based on the at least one candidate speech recognition result comprises:

20. The method of claim 19, wherein performing a path search in the word graph corresponding to each language comprises:

performing path search based on the sequence to obtain a first path set;

21. A speech recognition method, comprising:

acquiring a voice to be recognized comprising at least two languages;

22. The method according to claim 21, wherein the decoding is performed based on the speech features respectively corresponding to the languages and the text information features of the speech to be recognized, so as to obtain a first speech recognition result of the speech to be recognized, and the method includes:

23. The method of claim 22, wherein obtaining the first speech recognition result based on each first decoding feature and the corresponding weight coefficient comprises:

obtaining language classification features of the voice to be recognized based on the second decoding features;

24. A speech recognition apparatus, comprising:

25. A speech recognition apparatus, comprising:

the candidate voice recognition result acquisition module is used for acquiring to-be-recognized voice containing at least two languages and acquiring at least one candidate voice recognition result aiming at each language;

a word graph obtaining module, configured to obtain a word graph corresponding to each language based on the at least one candidate speech recognition result;

and the path searching module is used for searching paths in the word graphs corresponding to the languages and taking the text sequence corresponding to the optimal path as the first voice recognition result.

26. A speech recognition apparatus, comprising:

and the decoding module is used for decoding based on the voice characteristics respectively corresponding to all the languages and the text information characteristics of the voice to be recognized to obtain a first voice recognition result of the voice to be recognized.

27. An electronic device comprising a memory and a processor;

the memory has stored therein a computer program;

the processor for executing the computer program to implement the method of any one of claims 1 to 23.

28. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the method of any one of claims 1 to 23.