CN111383641B

CN111383641B - Voice recognition method, device and controller

Info

Publication number: CN111383641B
Application number: CN201811639786.6A
Authority: CN
Inventors: 黄佑佳; 聂为然; 于海; 翁富良
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2022-10-18
Anticipated expiration: 2038-12-29
Also published as: CN111383641A

Abstract

The method comprises the steps of establishing syllable transfer probability distribution of corresponding users in advance, determining syllable transfer probability of a syllable sequence corresponding to a voice signal input by the current user based on the syllable transfer probability distribution, determining candidate syllable texts corresponding to the voice signal based on the determined syllable transfer probability and a pre-established voice model, and outputting the candidate syllable texts. In the application, on one hand, the personalized pronunciation characteristics of each user are indicated by utilizing the syllable transition probability distribution, and the recognition accuracy can be improved in the voice recognition process. On the other hand, since the data amount required for the syllable transition probability distribution is small, the syllable transition probability distribution can be embedded in various devices for use, and the purpose of realizing personalized voice recognition on the terminal side (for example, a mobile phone, a car machine, and the like) can be achieved.

Description

Voice recognition method, device and controller

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, and controller.

Background

With the rapid development of the speech recognition technology, besides the recognition of mandarin, personalized speech recognition based on specific application scenes and dialects is derived, and the application requirements of the specific scenes in the fields of intelligent terminals, intelligent automobiles and the like are more and more.

At present, the common personalized speech recognition methods mainly include two types, one is: and the universal acoustic model is utilized to perform voice recognition, data are continuously accumulated in the using process, personalized data are manually marked, and after the data are accumulated to a certain degree, the universal acoustic model is retrained by utilizing the accumulated personalized data so as to be convenient for adapting to the personalized pronunciation characteristics. However, in the method, manual labeling is needed in the personalized data acquisition process, errors are prone to occur, in addition, the calculated amount of the general acoustic model for retraining is large, retraining needs to be performed once when the calculated amount is accumulated to a certain degree, and the method is difficult to achieve on the terminal side. The other is as follows: and uploading the local voice and the user related data to a cloud end, and utilizing the personalized data of the user to pertinently select a matched acoustic model from the plurality of candidate acoustic models at the cloud end. On one hand, the method needs to classify the user personalized data based on a large amount of data, but the classification is inaccurate due to the problems of regions, the speed of speech and the like. On the other hand, the method establishes the candidate acoustic models corresponding to the personalized data of the user according to the classification, the number of the candidate acoustic models is large, the candidate acoustic models are usually required to be realized at the cloud end, and the candidate acoustic models are difficult to realize at the terminal side.

In summary, the personalized speech recognition method in the prior art is difficult to be implemented at the terminal side, and errors are likely to occur when recognition is performed at the cloud.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device and a controller, so that the purpose of improving recognition accuracy while realizing personalized voice recognition on a terminal side, such as a smart phone, a smart car and the like, is achieved.

The embodiment of the application provides the following technical scheme:

a first aspect of an embodiment of the present application provides a speech recognition method, including:

acquiring text data corresponding to a voice signal input by a user, converting the text data into a syllable sequence, and converting the voice signal into the text data by a voice recognition engine;

determining syllable transition probabilities corresponding to the sequence of syllables based on the sequence of syllables and a first syllable transition probability distribution T corresponding to the user, the syllable transition probabilities including a transition probability that each syllable in the sequence of syllables corresponds to a true syllable;

and determining candidate syllable texts corresponding to the syllable sequences based on the syllable transfer probability and a pre-constructed language model, and outputting the candidate syllable texts.

According to the scheme, syllable transition probability distribution of the corresponding user is established in advance, syllable transition probability of a syllable sequence corresponding to the voice signal input by the current user is determined based on the syllable transition probability distribution, and then candidate syllable text corresponding to the voice signal is determined and output based on the determined syllable transition probability and a voice model established in advance. According to the method and the device, on one hand, the personalized pronunciation characteristics of each user are indicated by utilizing the syllable transition probability distribution, and the recognition accuracy can be improved in the voice recognition process. On the other hand, since the data amount required for the syllable transition probability distribution is small, the device can be used by being embedded in various devices, and the purpose of realizing personalized voice recognition on the terminal side can be achieved.

In one possible design, the pre-established first syllable transition probability distribution T corresponding to the user comprises:

acquiring a real syllable sequence U corresponding to a real voice sample of a user and an identified syllable sequence V obtained by identifying the real voice sample of the user by a voice identification engine;

determining a conditional probability of identifying each syllable in the identified syllable sequence V based on the true syllable sequence U and the identified syllable sequence V;

and calculating the product of the conditional probability of each syllable in the identified syllable sequence V to obtain a first syllable transfer probability distribution T, wherein the first syllable transfer probability distribution T is p (V | U).

acquiring a real syllable sequence U corresponding to a real voice sample of a user, and a recognized syllable sequence V obtained by recognizing the real voice sample of the user by a voice recognition engine;

the true value syllable U in the real syllable sequence U is divided into _i And a recognized syllable V in the recognized syllable sequence V _i One-to-one correspondence comparisonCounting any true value syllable u _i The first frequency occurring in the real syllable sequence U, and the true value syllable U _i The recognized syllable v of the corresponding position _i A second frequency of occurrence;

determining the identified syllable v based on the first frequency and the second frequency _i Second syllable transition probability p (v) _i |u _i )；

Counting and using all said identified syllables v _i Second syllable transition probability p (v) _i |u _i ) Establishing a first syllable transition probability distribution T corresponding to the user, wherein the first syllable transition probability distribution T is p (V | U).

According to the scheme, the corresponding first syllable transfer probability distribution T is pre-established for each user. That is, each user corresponds to one of its own first syllable transition probability distributions T. When the user inputs the voice signal subsequently, the subsequent voice recognition is carried out based on the corresponding unique first syllable transfer probability distribution T, and the method is more accurate.

In one possible design, the speech recognition method further includes:

determining a second syllable transition probability distribution Gamma obtained based on Mandarin corpus sample training _m Third syllable transition probability distribution T obtained based on accent group corpus sample training _g And a fourth syllable transition probability distribution T obtained based on user's real corpus sample training _p ；

Obtaining a fourth syllable transition probability distribution T in a preset time period _p Syllable transition probability T corresponding to any syllable _p (x, y), x and y being used to indicate the syllable transfer probability T _p (x, y) at the fourth syllable transition probability distribution T _p The coordinate position of (2);

when said syllable transfer probability T _p (x, y) is greater than 0, using said syllable transfer probability T _p (x, y) updating the first syllable transition probability distribution T;

when said syllable transfer probability T _p (x, y) less than 0, a syllable transfer probability T _g (x, y) is assigned toThe syllable transfer probability T _p (x, y) updating the first syllable transition probability distribution T, the syllable transition probability T _g (x, y) refers to the accent recognition probability corresponding to the syllable.

According to the scheme, training is carried out on the basis of the Mandarin Chinese corpus sample, the accent group corpus sample and the user real corpus sample, syllable transition probability distributions corresponding to the Mandarin Chinese corpus sample, in the subsequent voice recognition process, the personalized pronunciation characteristics of each user are indicated by the aid of various types of syllable transition probability distributions, and the purpose of improving recognition accuracy in the voice recognition process can be achieved.

In one possible design, the speech recognition method further includes:

updating the first syllable transition probability distribution T based on a semi-supervised approach.

In one possible design, after the determining and outputting the candidate syllable text corresponding to the first syllable sequence, the method further includes:

obtaining feedback information whether the candidate syllable text fed back by the user is consistent with the voice signal;

if the feedback information is that the candidate syllable text is consistent with the voice signal, updating the first syllable transition probability distribution T based on the first syllable sequence corresponding to the candidate syllable text.

According to the scheme, the first syllable transition probability distribution T corresponding to the user can be updated based on the user feedback, the first syllable transition probability distribution T can be further more accurate, a more accurate voice model is provided for subsequent voice recognition, and therefore the purpose of voice recognition accuracy is achieved.

In one possible design, the pre-building of the language model includes:

acquiring a text sequence corresponding to corpus information, wherein the corpus information comprises scene corpus, or the scene corpus is combined with one or more of corpus and open corpus generated according to a preset rule, and the scene corpus is the corpus information of a user in a specific scene;

and establishing a corresponding language model based on the corpus information.

In one possible design, the determining and outputting the candidate syllable text corresponding to the syllable sequence based on the syllable transition probability and a pre-established language model includes:

determining a decoding graph of a weighted finite state machine for indicating the pre-established language model, the decoding graph being composed of a prior probability p (W) of a text sequence W contained in the pre-established language model and a conditional probability p (UW) of the text sequence corresponding to a true syllable sequence U in the case of determining the text sequence W;

and determining candidate syllable texts based on the decoding graph and the transfer probability of each syllable corresponding to the real syllable in the syllable transfer probability, and outputting the candidate syllable texts.

A second aspect of the embodiments of the present application provides a speech recognition apparatus, including:

the system comprises a syllable transfer module, a voice recognition module and a processing module, wherein the syllable transfer module is used for acquiring text data corresponding to a voice signal input by a user, converting the text data into a syllable sequence, converting the voice signal by a voice recognition engine to obtain the text data, and determining syllable transfer probability corresponding to the syllable sequence based on the syllable sequence and a first syllable transfer probability distribution T corresponding to the user, wherein the syllable transfer probability comprises the transfer probability of each syllable in the syllable sequence corresponding to a real syllable;

and the decoding module is used for determining candidate syllable texts corresponding to the syllable sequences based on the syllable transition probability and a pre-constructed language model and outputting the candidate syllable texts.

In one possible design, the speech recognition device further includes:

the system comprises a first probability calculation module, a voice recognition engine, a first syllable transition probability distribution module and a second syllable transition probability distribution module, wherein the first probability calculation module is used for obtaining a real syllable sequence U corresponding to a real voice sample of a user, and a recognition syllable sequence V obtained by recognizing the real voice sample of the user through the voice recognition engine, determining a conditional probability of recognizing each syllable in the recognition syllable sequence V based on the real syllable sequence U and the recognition syllable sequence V, and calculating the product of the conditional probabilities of each syllable in the recognition syllable sequence V to obtain a first syllable transition probability distribution T, and the first syllable transition probability distribution T is p (V | U).

In one possible design, the speech recognition device further includes:

the second probability calculation module is used for acquiring a real syllable sequence U corresponding to a real voice sample of a user and a recognition syllable sequence V obtained by recognizing the real voice sample of the user by a voice recognition engine; dividing true value syllable U in the real syllable sequence U _i And a recognized syllable V in the recognized syllable sequence V _i Comparing the real values one by one, and counting any real value syllable u _i The first frequency occurring in the real syllable sequence U, and the true value syllable U _i The recognized syllable v of the corresponding position _i A second frequency of occurrence; determining the identified syllable v based on the first frequency and the second frequency _i Second syllable transition probability p (v) _i |u _i ) (ii) a Counting and using all said identified syllables v _i Second syllable transition probability p (v) _i |u _i ) Establishing a first syllable transition probability distribution T corresponding to the user, wherein the first syllable transition probability distribution T is p (V | U).

In one possible design, the speech recognition device further includes:

a first updating module for updating the first syllable transition probability distribution T based on a semi-supervised method.

In one possible design, the speech recognition device further includes:

the second updating module is used for acquiring feedback information whether the candidate syllable text fed back by the user is consistent with the voice signal; if the feedback information is that the candidate syllable text is consistent with the voice signal, updating the first syllable transition probability distribution T based on the first syllable sequence corresponding to the candidate syllable text.

In one possible design, the speech recognition device further includes:

the establishing module is used for obtaining a text sequence corresponding to the corpus information, wherein the corpus information comprises scene corpus, or the scene corpus is combined with one or more of corpus and open corpus generated according to a preset rule, the scene corpus is the corpus information of a user in a specific scene, and a corresponding language model is established based on the corpus information.

In one possible design, the decoding module is configured to determine a decoding map of a weighted finite state machine indicating the pre-established language model, determine candidate syllable texts based on the decoding map and transition probabilities of each syllable corresponding to a true syllable, and output the candidate syllable texts;

wherein the decoding graph is formed by a prior probability p (W) of a text sequence W contained in the pre-established language model and a conditional probability p (UW) of the text sequence corresponding to a real syllable sequence U under the condition that the text sequence W is determined.

A third aspect of the embodiments of the present application discloses a controller, including: a memory, and a processor in communication with the memory;

the memory for storing program code for speech recognition;

the processor is configured to call the program code for speech recognition in the memory, and execute the speech recognition method disclosed in the first aspect of the embodiment of the present application.

A fourth aspect of embodiments of the present application provides a non-transitory computer-readable storage medium for storing a computer program including instructions for performing a method in any one of the possible designs of the first aspect of embodiments of the present application.

Drawings

Fig. 1 is a schematic structural diagram of a speech recognition system disclosed in an embodiment of the present application;

fig. 2 is a schematic flow chart of a speech recognition method disclosed in the embodiment of the present application;

FIG. 3 is a diagram illustrating syllable transition probability distributions according to an embodiment of the present application;

FIG. 4 is a diagram of a weighted finite state machine as disclosed in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a speech recognition apparatus disclosed in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a controller according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Where in the description of the present application, "/" indicates an OR meaning, for example, A/B may indicate A or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. Also, in the description of the present application, "a plurality" means two or more than two unless otherwise specified. In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as "first" and "second" are used to distinguish identical items or similar items with substantially identical functions and actions. Those skilled in the art will appreciate that the terms "first," "second," and the like do not denote any order or importance, but rather the terms "first," "second," and the like do not denote any order or importance.

Furthermore, the terms "comprising" and "having" in the description of the embodiments and claims of the present application and the drawings are not intended to be exclusive. For example, a process, method, system, article, or apparatus that comprises a list of steps or modules is not limited to only those steps or modules listed, but may include other steps or modules not listed.

Automatic Speech Recognition (ASR) is a core technology in a conversational man-machine dialog system, is the basis of speech recognition at present, and can be applied to various intelligent terminals, such as mobile phones, vehicle-mounted terminals, and the like.

Fig. 1 is a schematic structural diagram of a speech recognition system disclosed in an embodiment of the present application. The speech recognition system 100 mainly includes: an ASR engine 101 and a speech recognition device 102 as disclosed in embodiments of the present application.

The ASR engine 101 is configured to receive a speech signal of a user, convert the speech signal into text data, and send the text data to the speech recognition device 102.

In a particular implementation, the ASR engine 101 converts the received speech signals into corresponding text data based on speech recognition techniques. The ASR engine may be an existing third party speech recognition engine. The speech recognition device 102 disclosed in the embodiment of the present application further performs speech recognition adapted to various environments and various people groups on the basis of text data output by a third-party speech recognition engine, and achieves the purpose of personalized speech recognition at the terminal side.

It should be noted that the voice signal may also be regarded as a voice instruction, which is a voice instruction corresponding to a corresponding operation to be performed given by the user. For example: when the user wants to turn on the air conditioner, a voice instruction including voice "turn on the air conditioner" is given.

The speech recognition apparatus 102 is configured to convert the acquired text data into a syllable sequence based on a text database such as a dictionary, and then realize speech recognition using the syllable sequence, syllable transition probability distribution, and a language model established in advance.

The process of performing speech recognition by the speech recognition apparatus based on the text data sent by the ASR engine is described in detail by the following embodiments disclosed in the present application.

As shown in fig. 2, which is a schematic flow chart of a speech recognition method disclosed in the embodiment of the present application, the speech recognition method includes the following steps:

step S201: acquiring text data corresponding to a voice signal input by a user, and converting the text data into a syllable sequence.

Based on the speech recognition system disclosed in fig. 1, the text data obtained by performing step S201 is obtained by converting the received speech signal by the ASR engine based on an automatic speech recognition technique. Step S201 is executed to convert the acquired text data into a syllable sequence based on a text database such as a dictionary.

Step S202: based on the syllable sequence and the pre-established first syllable transition probability distribution T of the corresponding user, the syllable transition probability corresponding to the syllable sequence is determined.

In the process of implementing step S202, the first syllable transition probability distribution T based on the corresponding user is established in advance for each user. That is, each user corresponds to one of its own first syllable transition probability distributions T.

In a specific implementation, the process of obtaining the corresponding first syllable transition probability distribution T by each user is as follows: training and calculating based on the syllable sequence corresponding to the voice data and the real syllables corresponding to the syllable sequence to obtain syllable transfer probability distribution, wherein the syllable transfer probability comprises the transfer probability of each syllable in the syllable sequence corresponding to the real syllable.

Alternatively, the first syllable transition probability distribution T corresponding to the user can be obtained in the following manner.

Firstly, a real syllable sequence U corresponding to a real voice sample of a user is obtained, and an ASR engine is used for identifying the real voice sample of the user to obtain an identified syllable sequence V. Wherein the real syllable sequence U = U ₁ u ₂ …u _k Identifying syllable sequence V = V ₁ v ₂ …v _k 。

Secondly, based on the true syllable sequence U and the recognized syllable sequence V, a conditional probability of recognizing each syllable in the syllable sequence V is determined.

Finally, as shown in formula (1), the product of the conditional probabilities for identifying each syllable in the syllable sequence V is calculated to obtain p (V | U). In the embodiment of the present application, p (V | U) is used to indicate the first syllable transition probability distribution T corresponding to the current user.

p(V|U)＝p(v ₁ |u ₁ )p(v ₂ |u ₂ )…p(v _n |u _k ) (1)

This formula (1) refers to the conditional probability that the ASR engine recognizes the occurrence of the resulting recognized syllable sequence V with a real speech sample given the real syllable sequence U. The following description will be given by taking as an example the real syllable sequence U = "d a k ai k ong t iao" and the recognized syllable sequence V = "d a k e k ong t iao". As can be seen from equation (1):

p(V|U)＝p(d|d)p(a|a)p(k|k)p(e|ai)p(k|k)p(t|t)p(iao|iao) (2)

for the syllable transition probability between each syllable, as can be seen from equation (2), p (e | ai) represents the probability that the user uttered the true syllable ai as e, or the probability that the identified syllable e potentially hidden the true utterance as ai. When p (ai | ai) is close to 1, it indicates that the user pronunciation is very standard on this syllable ai, otherwise it indicates that the user pronunciation is not standard, and this syllable ai will be pronounced to other syllables with a certain probability.

In a specific implementation, a syllable transition probability distribution can be represented in a matrix manner. As shown in FIG. 3, the syllable transition probability distribution of syllables such as "c o e \8230; b m s \" is shown in FIG. 3. Wherein, the closer the element on the diagonal of the matrix is to 1, the more standard the pronunciation of the user is. For example, p (s | c) =0.2 is shown in fig. 3, indicating that the standard syllable c has a probability of 0.2 being uttered into s; p (c | c) =0.8, meaning that the standard syllable c has a probability of 0.8 being pronounced as c; p (b | b) =0.9, indicating that a standard syllable b is spoken as b with a probability of 0.9; p (m | b) =0.1, indicating that the standard syllable b is spoken as m with a probability of 0.1.

Optionally, the first syllable transition probability distribution T corresponding to the user may also be obtained by calculation in a statistical manner. The method comprises the following specific steps:

firstly, a real syllable sequence U corresponding to a real voice sample of a user is obtained, and a recognized syllable sequence V obtained by recognizing the real voice sample of the user by a voice recognition engine is obtained. Wherein, the real syllable sequence U = U ₁ u ₂ …u _k Identifying syllable sequence V = V ₁ v ₂ …v _k 。

Secondly, the true value syllable U in the real syllable sequence U is divided into _i And a recognized syllable V in the recognized syllable sequence V _i One-to-one correspondence comparison is carried out, and any truth value syllable u is counted _i The first frequency count (U) appearing in the true syllable sequence U _i ) And true value syllable u _i Identified syllable v of corresponding position _i Second frequency of occurrence count (v) _i )。

Next, using equation (3)) Based on the first frequency count (u) _i ) And a second frequency count (v) _i ) Determining the identified syllable v _i Second syllable transition probability p (v) _i |u _i )。

p(v _i |u _i )＝count(v _i )/count(u _i ) (3)

Finally, all the identified syllables v are counted and utilized _i Second syllable transition probability p (v) _i |u _i ) A first syllable transition probability distribution t corresponding to a user is established.

The embodiment of the application is not limited to the two methods for obtaining the first syllable transfer probability distribution tau, optionally, a seq2seq deep neural network model can be also adopted, and the syllable sequence is directly trained to obtain the syllable transfer probability distribution of the corresponding syllable sequence.

Step S203: and determining candidate syllable texts corresponding to the syllable sequence based on the syllable transfer probability and a pre-constructed language model, and outputting the candidate syllable texts.

In step S203, the language model is pre-constructed by the existing corpus information, which may include only scene corpus. The method can also comprise one or more combinations of scene linguistic data and open linguistic data generated according to preset rules. It should be noted that the scene corpus is corpus information of the user in a specific scene. For example, corpus information obtained in an on-vehicle scenario. Or corpus information obtained in an office scenario. The application scenario of the speech recognition method is not limited in the embodiment of the present application, and any special scenario may be used, and the scenario corpus here corresponds to the corresponding scenario applied by the speech recognition method.

The specific process of constructing the language model comprises the following steps: acquiring a text sequence corresponding to the corpus information, training by adopting an n-gram method based on the corpus information, and establishing a language model corresponding to the corpus information. The language model may be transformed to obtain a graph of a weighted finite state machine corresponding to the speech model. The graph of the weighted finite state machine is associated with a syllable sequence.

The language model is used for the process of scoring and pruning the syllable sequence based on the syllable transition probability, is actually realized based on a graph of a weighted finite state machine after the language model is converted, and finally determines the syllable text corresponding to the syllable sequence with the highest syllable transition probability as a candidate syllable text and outputs the candidate syllable text.

Note that the output candidate syllable text is the optimal candidate syllable text.

In a specific implementation, a specific implementation process for determining and outputting a candidate syllable text corresponding to a syllable sequence includes the following steps:

first, a decoding graph indicating a weighted finite state machine of a pre-established language model is determined.

The decoded graph is a graph of a weighted finite state machine. The decoded image is composed of a prior probability p (W) of a text sequence W contained in a language model established in advance and a conditional probability p (U | W) that the text sequence W corresponds to a true syllable sequence U in the case where the text sequence W is determined.

Wherein, the text sequence W refers to the text sequence actually expressed by the user, W = W ₁ w ₂ …w _m 。

Then, based on the transition probability of each syllable corresponding to the true syllable in the decoding image and the syllable transition probability, the optimal candidate syllable text sequence is determined and output.

As is apparent from the above description of the embodiments of the present application, the decoded graph is obtained by converting a language model, and the decoded graph may constitute a decoder. And if the corpus information adopted for forming the language model comprises scene corpora, generating corpora and open corpora according to preset rules. Optionally, the prior probability p (W) and the conditional probability p (U | W) of the text sequence W corresponding to the real syllable sequence U are connected by an embedding manner, so as to construct and obtain a decoding graph.

Specifically, the decoding graph is composed of a plurality of paths, each path includes a plurality of nodes, and the structure of the nodes is as follows: syllable text/loss, or syllable < eps >/loss. Wherein syllable means currently inputted syllable, text means output text corresponding to the syllable, < eps > means no output text, and loss means loss generated by passing through the path.

In determining an optimal candidate syllable text sequence based on the transition probability of each syllable corresponding to the true syllable in the decoded picture and the syllable transition probability: when the optimal candidate syllable text corresponding to the syllable sequence is searched in the decoding graph, on the decoding graph, starting from the first node, corresponding to each syllable in the syllable sequence, selecting a path formed by nodes which have larger transition probability corresponding to the real syllable and have the minimum loss generated by the path, determining the text sequence formed by the text corresponding to each node on the path as the optimal candidate syllable text, and outputting the optimal candidate syllable text.

In a specific implementation, the probability of obtaining the text sequence actually expressed by the user based on the syllable sequence obtained from the ASR engine can be expressed by equation (4).

p(W|V)＝p(V|U)*p(U|W)*P(W)/P(V) (4)

The candidate syllable text sequence W obtained by executing the above steps S301 to S303 can be represented by (5)

W*＝arg max _w p(V|U)*p(U|W)*p(W) (5)

This is illustrated based on the decoding diagram shown in fig. 4.

Assuming that the voice signal is 'Calli', the corresponding syllable sequence is 'h u j iao i', if the probability that l is wrongly sent to n in the first syllable transition probability matrix is 0.6, the probability that l is sent to l is 0.4, and the rest syllables have no corresponding transition probability.

When a user with the pronunciation feature wants to say "call li", but is recognized as "call li bar" by the ASR due to pronunciation problems, the "call li bar" corresponds to the syllable sequence "h u j iao i ni". Based on the decoding graph disclosed in fig. 4, at the start node 0, since h cannot be pronounced into other syllables, the sequence goes from node 0 to node 2, and then the sequence goes along

nodes

9, 10, 11, and 12. At the node 12, since p (n | l) =0.6> (n | n) =0.4, although the input at this time is l, the probability that the true pronunciation is n is higher, and the loss of passage of two paths between two nodes of the node 12 and the node 13 is the same (the prior probability p (W) and the conditional probability p (U | W) constituting the decoding graph are partially the same), the node is preferentially moved to the node 13 and finally reaches the end node 18, so that the decoding result "call li" is obtained.

By means of the method for scoring and pruning the syllable sequence by combining the decoding image and the syllable transition probability, the optimal candidate syllable text is finally determined, and the aim of improving the recognition accuracy in the voice recognition process is fulfilled.

It should be noted that, in the process of executing step S202, the first syllable transition probability distribution involved in the implementation can be determined by the syllable transition probability distributions obtained by training different corpora. The trained speech data includes Mandarin corpus samples, accent population corpus samples, and user history corpus samples.

Training is carried out based on the Mandarin corpus sample, the accent group corpus sample and the user historical corpus sample, and syllable transfer probability distribution corresponding to the Mandarin corpus sample, the accent group corpus sample and the user historical corpus sample are obtained respectively. As shown in equation (6), the relationship between the first syllable transition probability distribution T disclosed in the embodiment of the present application and the syllable transition probability distribution obtained by training based on the different corpora is:

T＝w ₁ T _m +w ₂ T _g +w ₃ T _p (6)

wherein T is _m A second syllable transition probability distribution, T, obtained for Mandarin corpus sample training _g A third syllable transition probability distribution, T, obtained for speech sample training based on accent population _p And the fourth syllable transfer probability distribution is obtained based on the real corpus sample training of the user. w is a ₁ Transferring a probability distribution T for a second syllable _m Corresponding weight, w ₂ Transition probability distribution for third syllable _g Corresponding weight, w ₃ A fourth syllable transition probability distribution tau _p Corresponding weight, w ₁ +w ₂ +w ₃ ＝1

Note that the second syllable transition probability distribution t _m A third syllable transition probability distribution tau _g The weight corresponding to each of the weight values,the type and confidence level determination of the user's accent given by the accent recognition classifier may be employed. w is a ₂ Can reflect the degree of accent of the user, w ₃ The personalized pronunciation characteristics of the user can be embodied.

Second syllable transition probability distribution T _m Corresponding weight w ₁ And a third syllable transition probability distribution T _g Corresponding weight w ₂ Will gradually decrease over time. A fourth syllable transition probability distribution tau _p Corresponding weight w ₃ The weight w ₃ The updating of the probability distribution of the first syllable transition is usually time-dependent, with the longer the time, the higher the weight value.

Weight w ₁ Weight w ₂ And a weight w ₃ The update principle of (2) is shown in formula (7):

where f (t) is a function of the decay with time t.

Alternatively, it may take 1/t (t)>1) And the like. G (wav) refers to an accent recognition classification model for indicating to which accent a certain person's speech can be recognized and to which probability. G (wav) _m Probability of speech data belonging to Mandarin, G (wav) _g Representing the probability that the speech data belongs to a certain group of accents.

With respect to updating the first syllable transition probability distribution, the prevalence of the personalization data is relatively important. In specific implementation, because collection or statistics of personalized data is influenced by time factors, in the process of updating the first syllable transition probability distribution T, only aiming at a fourth syllable transition probability distribution T obtained by training a real corpus sample based on a user _p And (6) updating. In particular, for T _p (x,y)>The position of 0 is updated. The specific updating process is as follows:

firstly, obtaining a fourth syllable transition probability distribution T in a preset time period _p Syllable transition probability T corresponding to any syllable _p (x,y)。Wherein x and y are used to indicate a syllable transition probability T _p (x, y) at a fourth syllable transition probability distribution T _p Is measured.

Second, when syllable transition probability T _p (x, y) greater than 0, utilizing syllable transfer probability T _p (x, y) updating the first syllable transition probability distribution T.

The specific updating process is shown in formula (8):

T(x,y)＝w ₁ T _m (x,y)+w ₂ T _g (x,y)+w ₃ T _p (x,y)，T _p (x,y)>0 (8)

second, when syllable transition probability T _p (x, y) equal to 0, probability of syllable transfer T _g (x, y) assignment to syllable transition probability T _p (x, y) updating the first syllable transition probability distribution T.

The specific updating process is shown in formula (9):

T(x,y)＝w ₁ T _m (x,y)+(w ₂ +w ₃ )T _g (x,y),T _p (x,y)<0 (9)

wherein, syllable transfer probability T _g (x, y) refers to the accent recognition probability corresponding to a syllable.

That is, when T _p (x, y) equal to 0, no Gamma _p Updating by means of T _g And (6) updating. After the final calculation and updating are finished, the first syllable transfer probability distribution T is normalized again, and the updating of the first syllable transfer probability distribution T is finished.

It should be noted that, for the update of the first syllable transition probability distribution T, optionally, the first syllable transition probability distribution T may also be updated based on a semi-supervised method.

Optionally, after determining the candidate syllable text corresponding to the first syllable sequence and outputting the candidate syllable text, obtaining feedback information whether the candidate syllable text fed back by the user is consistent with the voice signal, and if the feedback information is that the candidate syllable text is consistent with the voice signal, updating the first syllable transition probability distribution T based on the first syllable sequence corresponding to the candidate syllable text.

By the voice recognition method provided by the embodiment of the application, the syllable transition probability of the syllable sequence corresponding to the voice signal input by the current user is determined based on the pre-established syllable transition probability distribution of the corresponding user, and then the candidate syllable text corresponding to the voice signal is determined and output by using the determined syllable transition probability and the pre-established voice model. In the embodiment of the application, on one hand, the personalized pronunciation characteristics of each user are indicated by utilizing the syllable transfer probability distribution, and the aim of improving the recognition accuracy in the voice recognition process is fulfilled. On the other hand, since the data amount required for the syllable transition probability distribution is small, it can be embedded in various devices for use, and the purpose of realizing personalized speech recognition on the terminal side is achieved.

Based on the voice recognition method disclosed by the embodiment of the application, the embodiment of the application also correspondingly discloses a voice recognition device for realizing the voice recognition method.

Fig. 5 is a schematic structural diagram of a speech recognition apparatus 500 disclosed in an embodiment of the present application. The speech recognition apparatus 500 includes: a syllable transfer module 501 and a decoding module 502.

A syllable transition module 501, configured to obtain text data corresponding to a voice signal input by a user, convert the text data into a syllable sequence, where the text data is obtained by converting the voice signal by a voice recognition engine, and determine a syllable transition probability corresponding to the syllable sequence based on the syllable sequence and a pre-established first syllable transition probability distribution T corresponding to the user, where the syllable transition probability includes a transition probability of each syllable in the syllable sequence corresponding to a real syllable.

And the decoding module 502 is used for determining candidate syllable texts corresponding to the syllable sequences based on the syllable transition probability and the pre-constructed language model and outputting the candidate syllable texts.

The decoding module 502 is specifically configured to determine a decoding graph of a weighted finite state machine indicating the pre-established language model, determine candidate syllable texts based on the decoding graph and the transition probabilities of each syllable corresponding to the true syllables in the syllable transition probabilities, and output the candidate syllable texts.

Wherein the decoding graph consists of a prior probability p (W) of a text sequence W contained in the pre-established language model and a conditional probability p (UW) of the text sequence corresponding to a real syllable sequence U under the condition that the text sequence W is determined.

Optionally, the speech recognition apparatus 500 further includes a first probability calculation module.

The first probability calculation module is used for obtaining a real syllable sequence U corresponding to a real voice sample of a user, and an identification syllable sequence V obtained by identifying the real voice sample of the user by a voice identification engine, determining a conditional probability of identifying each syllable in the identification syllable sequence V based on the real syllable sequence U and the identification syllable sequence V, and calculating a product of the conditional probabilities of identifying each syllable in the syllable sequence V to obtain a first syllable transition probability distribution T, wherein the first syllable transition probability distribution T is p (V | U).

Optionally, the speech recognition apparatus 500 further includes a second probability calculation module.

The second probability calculation module is used for acquiring a real syllable sequence U corresponding to the real voice sample of the user and a recognition syllable sequence V obtained by recognizing the real voice sample of the user by the voice recognition engine; true value syllable U in true syllable sequence U _i And identifying the identified syllable V in the syllable sequence V _i One-to-one correspondence comparison is carried out to count any truth value syllable u _i The first frequency occurring in the real syllable sequence U, and the true value syllable U _i The recognized syllable v of the corresponding position _i A second frequency of occurrence; determining the recognized syllable v based on the first frequency and the second frequency _i Second syllable transition probability p (v) _i |u _i ) (ii) a Counting and using all recognized syllables v _i Second syllable transition probability p (v) _i |u _i ) And establishing a first syllable transition probability distribution T of the corresponding user, wherein the first syllable transition probability distribution T is p (V | U).

Optionally, the speech recognition apparatus 500 further includes a first updating module.

And the first updating module is used for updating the first syllable transfer probability distribution T based on a semi-supervised method.

Optionally, the speech recognition apparatus 500 further includes a second updating module.

The second updating module is used for acquiring feedback information of whether the candidate syllable text fed back by the user is consistent with the voice signal; and if the feedback information is that the candidate syllable text is consistent with the voice signal, updating the first syllable transition probability distribution T based on the first syllable sequence corresponding to the candidate syllable text.

Optionally, the speech recognition apparatus 500 further includes a setup module.

And the establishing module is used for acquiring the text sequence corresponding to the corpus information and establishing a corresponding language model based on the text sequence corresponding to the corpus information. The corpus information includes scene corpus, or the scene corpus is combined with one or more of corpus and open corpus generated according to preset rules. The scene corpus is the corpus information of the user in a specific scene.

The execution principle and process of each module in the speech recognition apparatus disclosed in the embodiment of the present application are the same as the execution principle and process corresponding to the corresponding step in the speech recognition method disclosed in the embodiment of the present application, and reference may be made to the corresponding part in the speech recognition method disclosed in the embodiment of the present application, which is not described herein again.

In combination with the speech recognition method disclosed in fig. 2 of the embodiment of the present application and the speech recognition apparatus disclosed in fig. 5 of the embodiment of the present application, that is, the speech recognition apparatus disclosed in fig. 1 of the embodiment of the present application, the speech recognition method and apparatus can also be directly implemented by a controller formed by hardware, a memory executed by a processor, or a combination of the two. As shown in fig. 6, a schematic structural diagram of a controller for implementing a speech recognition method disclosed in the embodiment of the present application may be further disposed in a main control system or a processing system in a terminal side. The controller 600 includes: a memory 601, and a processor 602 and a communication interface 603 in communication with the memory.

The processor 602 is coupled to the memory 601 through a bus. The processor 602 is coupled to the communication interface 603 via a bus.

The processor 602 may be a Central Processing Unit (CPU), a Network Processor (NP), an application-specific integrated circuit (ASIC), or a Programmable Logic Device (PLD). The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), or a General Array Logic (GAL).

The memory 601 may specifically be a content-addressable memory (CAM) or a random-access memory (RAM). The CAM may be a Ternary Content Addressable Memory (TCAM).

The communication interface 603 may be a wired interface, such as a Fiber Distributed Data Interface (FDDI) or ethernet interface.

The memory 601 may also be integrated in the processor 602. If the memory 601 and the processor 602 are independent devices, the memory 601 and the processor 602 are connected, for example, the memory 601 and the processor 602 can communicate via a bus. The communication interface 603 and the processor 602 may communicate via a bus, and the communication interface 603 may be directly connected to the processor 602.

A memory 601 for storing program code for speech recognition. Optionally, the memory 601 includes an operating system and an application program, which are used to carry the operating program, codes or instructions of the speech recognition method disclosed in the embodiment of the present application.

When the processor 602 or the hardware device is to perform the operations related to the speech recognition method disclosed in the embodiment of the present application, the processor calls and executes the operating program, code, or instructions stored in the memory 601 to complete the processes of the speech recognition method described in the embodiment of the present application. The specific process is as follows: the processor 602 invokes the program code for speech recognition in the memory 601 to execute the speech recognition method.

Processor 602 implements the steps in the method embodiments by calling a program in memory 601. The processor 602 may also be a physical device that embodies the speech recognition apparatus 500.

It is to be understood that the operations of receiving/transmitting and the like involved in the above-mentioned voice recognition apparatus and voice recognition method embodiment shown in fig. 5 may be receiving/transmitting processing implemented by a processor, or may be transmitting/receiving processes implemented by a receiver and a transmitter, and the receiver and the transmitter may exist independently or may be integrated into a transceiver.

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

In summary, with the voice recognition method and the voice recognition apparatus provided in the embodiments of the present application, based on the pre-established syllable transition probability distribution of the corresponding user, the syllable transition probability of the syllable sequence corresponding to the voice signal input by the current user is determined, and then the candidate syllable text corresponding to the voice signal is determined and output by using the determined syllable transition probability and the pre-established voice model. In the embodiment of the application, on one hand, the personalized pronunciation characteristics of each user are indicated by using syllable transition probability distribution, and the aim of improving the recognition accuracy in the voice recognition process is fulfilled. On the other hand, since the data amount required for the syllable transition probability distribution is small, it can be embedded in various devices for use, and the purpose of realizing personalized speech recognition on the terminal side is achieved.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application and the benefits derived therefrom have been described in detail with reference to the foregoing embodiments, those skilled in the art will appreciate that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; but such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present claims.

Claims

1. A speech recognition method, comprising:

2. The method of claim 1, wherein the pre-established procedure of the first syllable transition probability distribution T corresponding to the user comprises:

determining a conditional probability for each syllable in the identified syllable sequence V based on the true syllable sequence U and the identified syllable sequence V;

3. The method of claim 1, wherein the pre-established procedure for the first syllable transition probability distribution T corresponding to the user comprises:

dividing true value syllable U in the real syllable sequence U _i And a recognized syllable V in the recognized syllable sequence V _i Comparing the real values one by one, and counting any real value syllable u _i The first frequency occurring in the real syllable sequence U, and the true value syllable U _i The recognized syllable v of the corresponding position _i A second frequency of occurrence;

4. The method according to any one of claims 1-3, further comprising:

determining a second syllable transition probability distribution T based on Mandarin corpus sample training _m Third syllable transition probability distribution T obtained based on accent group corpus sample training _g And a fourth syllable transition probability distribution T obtained based on the real corpus sample training of the user _p ；

Obtaining the fourth syllable transition probability distribution T in the preset time period _p Syllable transition probability T corresponding to any syllable _p (x, y), x and y are used to indicate the syllable transition probability T _p (x, y) probability distribution T of transition at said fourth syllable _p The coordinate position of (2);

when the syllable transition probability T _p (x, y) is greater than 0, using the syllable transition probability T _p (x, y) updating the first syllable transition probability distribution T;

when the syllable transition probability T _p (x, y) is less than 0, the syllable is transferred with probability T _g (x, y) assigning a value to the syllable transition probability T _p (x, y) updating the first syllable transition probability distribution T, the syllable transition probability T _g (x, y) refers to the accent recognition probability corresponding to the syllable.

5. The method according to any one of claims 1-3, further comprising:

updating the first syllable transition probability distribution T based on a semi-supervised method.

6. The method according to any one of claims 1-3, wherein after said determining the candidate syllable text to which the syllable sequence corresponds and outputting, further comprising:

obtaining feedback information of whether the candidate syllable text fed back by the user is consistent with the voice signal;

if the feedback information is that the candidate syllable text is consistent with the voice signal, updating the first syllable transition probability distribution T based on the syllable sequence corresponding to the candidate syllable text.

7. A method according to any one of claims 1 to 3, wherein the pre-building of the language model comprises:

8. The method according to any one of claims 1-3, wherein the determining and outputting candidate syllable texts corresponding to the syllable sequence based on the syllable transition probabilities and a pre-constructed language model comprises:

determining a decoding graph of a weighted finite state machine for indicating the pre-constructed language model, the decoding graph being composed of a prior probability p (W) of a text sequence W contained in the pre-constructed language model and a conditional probability p (U | W) that the text sequence corresponds to a real syllable sequence U in case the text sequence W is determined;

9. A speech recognition apparatus, comprising:

and the decoding module is used for determining candidate syllable texts corresponding to the syllable sequence based on the syllable transition probability and a pre-constructed language model and outputting the candidate syllable texts.

10. The speech recognition device of claim 9, further comprising:

the system comprises a first probability calculation module, a voice recognition engine, a first syllable transition probability distribution module and a second syllable transition probability distribution module, wherein the first probability calculation module is used for obtaining a real syllable sequence U corresponding to a real voice sample of a user, and a recognition syllable sequence V obtained by recognizing the real voice sample of the user through the voice recognition engine, determining a conditional probability of each syllable in the recognition syllable sequence V based on the real syllable sequence U and the recognition syllable sequence V, and calculating the product of the conditional probabilities of each syllable in the recognition syllable sequence V to obtain a first syllable transition probability distribution T, and the first syllable transition probability distribution T is p (V | U).

11. The speech recognition device of claim 9, further comprising:

the second probability calculation module is used for acquiring a real syllable sequence U corresponding to a real voice sample of a user and an identified syllable sequence V obtained by identifying the real voice sample of the user by a voice identification engine; the true value syllable U in the real syllable sequence U is divided into _i And a recognized syllable V in the recognized syllable sequence V _i Comparing the real values one by one, and counting any real value syllable u _i The first frequency occurring in the real syllable sequence U, and the true value syllable U _i Identified syllable v of corresponding position _i A second frequency of occurrence; determining the identified syllable v based on the first frequency and the second frequency _i Second syllable transition probability p (v) _i |u _i ) (ii) a Counting and using all of the identified syllables v _i Second syllable transition probability p (v) _i |u _i ) Establishing a first syllable transition probability distribution T corresponding to the user, wherein the first syllable transition probability distribution T is p (V | U).

12. The speech recognition device according to any one of claims 9 to 11, further comprising:

13. The speech recognition device according to any one of claims 9 to 11, further comprising:

the second updating module is used for obtaining feedback information whether the candidate syllable text fed back by the user is consistent with the voice signal; if the feedback information is that the candidate syllable text is consistent with the voice signal, updating the first syllable transition probability distribution T based on the syllable sequence corresponding to the candidate syllable text.

14. The speech recognition device according to any one of claims 9 to 11, further comprising:

15. The speech recognition device of any one of claims 9-11, wherein the decoding module is configured to determine a decoding map indicating a weighted finite state machine of the pre-constructed language model, determine a candidate syllable text based on the decoding map and a transition probability of each syllable corresponding to a true syllable, and output the candidate syllable text;

wherein the decoding graph consists of a prior probability p (W) of a text sequence W contained in the pre-constructed language model and a conditional probability p (UW) of the text sequence corresponding to a real syllable sequence U if the text sequence W is determined.

16. A controller, comprising: a memory, and a processor in communication with the memory;

the memory for storing program code for speech recognition;

the processor, program code for invoking the speech recognition in the memory, to perform the speech recognition method of any of claims 1-8.

17. A non-transitory computer-readable storage medium storing a computer program comprising instructions for performing the speech recognition method of any one of claims 1-8.