CN111383641B - Voice recognition method, device and controller - Google Patents

Voice recognition method, device and controller Download PDF

Info

Publication number
CN111383641B
CN111383641B CN201811639786.6A CN201811639786A CN111383641B CN 111383641 B CN111383641 B CN 111383641B CN 201811639786 A CN201811639786 A CN 201811639786A CN 111383641 B CN111383641 B CN 111383641B
Authority
CN
China
Prior art keywords
syllable
sequence
transition probability
probability distribution
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811639786.6A
Other languages
Chinese (zh)
Other versions
CN111383641A (en
Inventor
黄佑佳
聂为然
于海
翁富良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201811639786.6A priority Critical patent/CN111383641B/en
Publication of CN111383641A publication Critical patent/CN111383641A/en
Application granted granted Critical
Publication of CN111383641B publication Critical patent/CN111383641B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • G10L2013/105Duration
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/027Syllables being the recognition units

Abstract

The method comprises the steps of establishing syllable transfer probability distribution of corresponding users in advance, determining syllable transfer probability of a syllable sequence corresponding to a voice signal input by the current user based on the syllable transfer probability distribution, determining candidate syllable texts corresponding to the voice signal based on the determined syllable transfer probability and a pre-established voice model, and outputting the candidate syllable texts. In the application, on one hand, the personalized pronunciation characteristics of each user are indicated by utilizing the syllable transition probability distribution, and the recognition accuracy can be improved in the voice recognition process. On the other hand, since the data amount required for the syllable transition probability distribution is small, the syllable transition probability distribution can be embedded in various devices for use, and the purpose of realizing personalized voice recognition on the terminal side (for example, a mobile phone, a car machine, and the like) can be achieved.

Description

Voice recognition method, device and controller
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, and controller.
Background
With the rapid development of the speech recognition technology, besides the recognition of mandarin, personalized speech recognition based on specific application scenes and dialects is derived, and the application requirements of the specific scenes in the fields of intelligent terminals, intelligent automobiles and the like are more and more.
At present, the common personalized speech recognition methods mainly include two types, one is: and the universal acoustic model is utilized to perform voice recognition, data are continuously accumulated in the using process, personalized data are manually marked, and after the data are accumulated to a certain degree, the universal acoustic model is retrained by utilizing the accumulated personalized data so as to be convenient for adapting to the personalized pronunciation characteristics. However, in the method, manual labeling is needed in the personalized data acquisition process, errors are prone to occur, in addition, the calculated amount of the general acoustic model for retraining is large, retraining needs to be performed once when the calculated amount is accumulated to a certain degree, and the method is difficult to achieve on the terminal side. The other is as follows: and uploading the local voice and the user related data to a cloud end, and utilizing the personalized data of the user to pertinently select a matched acoustic model from the plurality of candidate acoustic models at the cloud end. On one hand, the method needs to classify the user personalized data based on a large amount of data, but the classification is inaccurate due to the problems of regions, the speed of speech and the like. On the other hand, the method establishes the candidate acoustic models corresponding to the personalized data of the user according to the classification, the number of the candidate acoustic models is large, the candidate acoustic models are usually required to be realized at the cloud end, and the candidate acoustic models are difficult to realize at the terminal side.
In summary, the personalized speech recognition method in the prior art is difficult to be implemented at the terminal side, and errors are likely to occur when recognition is performed at the cloud.
Disclosure of Invention
The embodiment of the application provides a voice recognition method, a voice recognition device and a controller, so that the purpose of improving recognition accuracy while realizing personalized voice recognition on a terminal side, such as a smart phone, a smart car and the like, is achieved.
The embodiment of the application provides the following technical scheme:
a first aspect of an embodiment of the present application provides a speech recognition method, including:
acquiring text data corresponding to a voice signal input by a user, converting the text data into a syllable sequence, and converting the voice signal into the text data by a voice recognition engine;
determining syllable transition probabilities corresponding to the sequence of syllables based on the sequence of syllables and a first syllable transition probability distribution T corresponding to the user, the syllable transition probabilities including a transition probability that each syllable in the sequence of syllables corresponds to a true syllable;
and determining candidate syllable texts corresponding to the syllable sequences based on the syllable transfer probability and a pre-constructed language model, and outputting the candidate syllable texts.
According to the scheme, syllable transition probability distribution of the corresponding user is established in advance, syllable transition probability of a syllable sequence corresponding to the voice signal input by the current user is determined based on the syllable transition probability distribution, and then candidate syllable text corresponding to the voice signal is determined and output based on the determined syllable transition probability and a voice model established in advance. According to the method and the device, on one hand, the personalized pronunciation characteristics of each user are indicated by utilizing the syllable transition probability distribution, and the recognition accuracy can be improved in the voice recognition process. On the other hand, since the data amount required for the syllable transition probability distribution is small, the device can be used by being embedded in various devices, and the purpose of realizing personalized voice recognition on the terminal side can be achieved.
In one possible design, the pre-established first syllable transition probability distribution T corresponding to the user comprises:
acquiring a real syllable sequence U corresponding to a real voice sample of a user and an identified syllable sequence V obtained by identifying the real voice sample of the user by a voice identification engine;
determining a conditional probability of identifying each syllable in the identified syllable sequence V based on the true syllable sequence U and the identified syllable sequence V;
and calculating the product of the conditional probability of each syllable in the identified syllable sequence V to obtain a first syllable transfer probability distribution T, wherein the first syllable transfer probability distribution T is p (V | U).
In one possible design, the pre-established first syllable transition probability distribution T corresponding to the user comprises:
acquiring a real syllable sequence U corresponding to a real voice sample of a user, and a recognized syllable sequence V obtained by recognizing the real voice sample of the user by a voice recognition engine;
the true value syllable U in the real syllable sequence U is divided into i And a recognized syllable V in the recognized syllable sequence V i One-to-one correspondence comparisonCounting any true value syllable u i The first frequency occurring in the real syllable sequence U, and the true value syllable U i The recognized syllable v of the corresponding position i A second frequency of occurrence;
determining the identified syllable v based on the first frequency and the second frequency i Second syllable transition probability p (v) i |u i );
Counting and using all said identified syllables v i Second syllable transition probability p (v) i |u i ) Establishing a first syllable transition probability distribution T corresponding to the user, wherein the first syllable transition probability distribution T is p (V | U).
According to the scheme, the corresponding first syllable transfer probability distribution T is pre-established for each user. That is, each user corresponds to one of its own first syllable transition probability distributions T. When the user inputs the voice signal subsequently, the subsequent voice recognition is carried out based on the corresponding unique first syllable transfer probability distribution T, and the method is more accurate.
In one possible design, the speech recognition method further includes:
determining a second syllable transition probability distribution Gamma obtained based on Mandarin corpus sample training m Third syllable transition probability distribution T obtained based on accent group corpus sample training g And a fourth syllable transition probability distribution T obtained based on user's real corpus sample training p
Obtaining a fourth syllable transition probability distribution T in a preset time period p Syllable transition probability T corresponding to any syllable p (x, y), x and y being used to indicate the syllable transfer probability T p (x, y) at the fourth syllable transition probability distribution T p The coordinate position of (2);
when said syllable transfer probability T p (x, y) is greater than 0, using said syllable transfer probability T p (x, y) updating the first syllable transition probability distribution T;
when said syllable transfer probability T p (x, y) less than 0, a syllable transfer probability T g (x, y) is assigned toThe syllable transfer probability T p (x, y) updating the first syllable transition probability distribution T, the syllable transition probability T g (x, y) refers to the accent recognition probability corresponding to the syllable.
According to the scheme, training is carried out on the basis of the Mandarin Chinese corpus sample, the accent group corpus sample and the user real corpus sample, syllable transition probability distributions corresponding to the Mandarin Chinese corpus sample, in the subsequent voice recognition process, the personalized pronunciation characteristics of each user are indicated by the aid of various types of syllable transition probability distributions, and the purpose of improving recognition accuracy in the voice recognition process can be achieved.
In one possible design, the speech recognition method further includes:
updating the first syllable transition probability distribution T based on a semi-supervised approach.
In one possible design, after the determining and outputting the candidate syllable text corresponding to the first syllable sequence, the method further includes:
obtaining feedback information whether the candidate syllable text fed back by the user is consistent with the voice signal;
if the feedback information is that the candidate syllable text is consistent with the voice signal, updating the first syllable transition probability distribution T based on the first syllable sequence corresponding to the candidate syllable text.
According to the scheme, the first syllable transition probability distribution T corresponding to the user can be updated based on the user feedback, the first syllable transition probability distribution T can be further more accurate, a more accurate voice model is provided for subsequent voice recognition, and therefore the purpose of voice recognition accuracy is achieved.
In one possible design, the pre-building of the language model includes:
acquiring a text sequence corresponding to corpus information, wherein the corpus information comprises scene corpus, or the scene corpus is combined with one or more of corpus and open corpus generated according to a preset rule, and the scene corpus is the corpus information of a user in a specific scene;
and establishing a corresponding language model based on the corpus information.
In one possible design, the determining and outputting the candidate syllable text corresponding to the syllable sequence based on the syllable transition probability and a pre-established language model includes:
determining a decoding graph of a weighted finite state machine for indicating the pre-established language model, the decoding graph being composed of a prior probability p (W) of a text sequence W contained in the pre-established language model and a conditional probability p (UW) of the text sequence corresponding to a true syllable sequence U in the case of determining the text sequence W;
and determining candidate syllable texts based on the decoding graph and the transfer probability of each syllable corresponding to the real syllable in the syllable transfer probability, and outputting the candidate syllable texts.
A second aspect of the embodiments of the present application provides a speech recognition apparatus, including:
the system comprises a syllable transfer module, a voice recognition module and a processing module, wherein the syllable transfer module is used for acquiring text data corresponding to a voice signal input by a user, converting the text data into a syllable sequence, converting the voice signal by a voice recognition engine to obtain the text data, and determining syllable transfer probability corresponding to the syllable sequence based on the syllable sequence and a first syllable transfer probability distribution T corresponding to the user, wherein the syllable transfer probability comprises the transfer probability of each syllable in the syllable sequence corresponding to a real syllable;
and the decoding module is used for determining candidate syllable texts corresponding to the syllable sequences based on the syllable transition probability and a pre-constructed language model and outputting the candidate syllable texts.
In one possible design, the speech recognition device further includes:
the system comprises a first probability calculation module, a voice recognition engine, a first syllable transition probability distribution module and a second syllable transition probability distribution module, wherein the first probability calculation module is used for obtaining a real syllable sequence U corresponding to a real voice sample of a user, and a recognition syllable sequence V obtained by recognizing the real voice sample of the user through the voice recognition engine, determining a conditional probability of recognizing each syllable in the recognition syllable sequence V based on the real syllable sequence U and the recognition syllable sequence V, and calculating the product of the conditional probabilities of each syllable in the recognition syllable sequence V to obtain a first syllable transition probability distribution T, and the first syllable transition probability distribution T is p (V | U).
In one possible design, the speech recognition device further includes:
the second probability calculation module is used for acquiring a real syllable sequence U corresponding to a real voice sample of a user and a recognition syllable sequence V obtained by recognizing the real voice sample of the user by a voice recognition engine; dividing true value syllable U in the real syllable sequence U i And a recognized syllable V in the recognized syllable sequence V i Comparing the real values one by one, and counting any real value syllable u i The first frequency occurring in the real syllable sequence U, and the true value syllable U i The recognized syllable v of the corresponding position i A second frequency of occurrence; determining the identified syllable v based on the first frequency and the second frequency i Second syllable transition probability p (v) i |u i ) (ii) a Counting and using all said identified syllables v i Second syllable transition probability p (v) i |u i ) Establishing a first syllable transition probability distribution T corresponding to the user, wherein the first syllable transition probability distribution T is p (V | U).
In one possible design, the speech recognition device further includes:
a first updating module for updating the first syllable transition probability distribution T based on a semi-supervised method.
In one possible design, the speech recognition device further includes:
the second updating module is used for acquiring feedback information whether the candidate syllable text fed back by the user is consistent with the voice signal; if the feedback information is that the candidate syllable text is consistent with the voice signal, updating the first syllable transition probability distribution T based on the first syllable sequence corresponding to the candidate syllable text.
In one possible design, the speech recognition device further includes:
the establishing module is used for obtaining a text sequence corresponding to the corpus information, wherein the corpus information comprises scene corpus, or the scene corpus is combined with one or more of corpus and open corpus generated according to a preset rule, the scene corpus is the corpus information of a user in a specific scene, and a corresponding language model is established based on the corpus information.
In one possible design, the decoding module is configured to determine a decoding map of a weighted finite state machine indicating the pre-established language model, determine candidate syllable texts based on the decoding map and transition probabilities of each syllable corresponding to a true syllable, and output the candidate syllable texts;
wherein the decoding graph is formed by a prior probability p (W) of a text sequence W contained in the pre-established language model and a conditional probability p (UW) of the text sequence corresponding to a real syllable sequence U under the condition that the text sequence W is determined.
A third aspect of the embodiments of the present application discloses a controller, including: a memory, and a processor in communication with the memory;
the memory for storing program code for speech recognition;
the processor is configured to call the program code for speech recognition in the memory, and execute the speech recognition method disclosed in the first aspect of the embodiment of the present application.
A fourth aspect of embodiments of the present application provides a non-transitory computer-readable storage medium for storing a computer program including instructions for performing a method in any one of the possible designs of the first aspect of embodiments of the present application.
Drawings
Fig. 1 is a schematic structural diagram of a speech recognition system disclosed in an embodiment of the present application;
fig. 2 is a schematic flow chart of a speech recognition method disclosed in the embodiment of the present application;
FIG. 3 is a diagram illustrating syllable transition probability distributions according to an embodiment of the present application;
FIG. 4 is a diagram of a weighted finite state machine as disclosed in an embodiment of the present application;
fig. 5 is a schematic structural diagram of a speech recognition apparatus disclosed in an embodiment of the present application;
fig. 6 is a schematic structural diagram of a controller according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Where in the description of the present application, "/" indicates an OR meaning, for example, A/B may indicate A or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. Also, in the description of the present application, "a plurality" means two or more than two unless otherwise specified. In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as "first" and "second" are used to distinguish identical items or similar items with substantially identical functions and actions. Those skilled in the art will appreciate that the terms "first," "second," and the like do not denote any order or importance, but rather the terms "first," "second," and the like do not denote any order or importance.
Furthermore, the terms "comprising" and "having" in the description of the embodiments and claims of the present application and the drawings are not intended to be exclusive. For example, a process, method, system, article, or apparatus that comprises a list of steps or modules is not limited to only those steps or modules listed, but may include other steps or modules not listed.
Automatic Speech Recognition (ASR) is a core technology in a conversational man-machine dialog system, is the basis of speech recognition at present, and can be applied to various intelligent terminals, such as mobile phones, vehicle-mounted terminals, and the like.
Fig. 1 is a schematic structural diagram of a speech recognition system disclosed in an embodiment of the present application. The speech recognition system 100 mainly includes: an ASR engine 101 and a speech recognition device 102 as disclosed in embodiments of the present application.
The ASR engine 101 is configured to receive a speech signal of a user, convert the speech signal into text data, and send the text data to the speech recognition device 102.
In a particular implementation, the ASR engine 101 converts the received speech signals into corresponding text data based on speech recognition techniques. The ASR engine may be an existing third party speech recognition engine. The speech recognition device 102 disclosed in the embodiment of the present application further performs speech recognition adapted to various environments and various people groups on the basis of text data output by a third-party speech recognition engine, and achieves the purpose of personalized speech recognition at the terminal side.
It should be noted that the voice signal may also be regarded as a voice instruction, which is a voice instruction corresponding to a corresponding operation to be performed given by the user. For example: when the user wants to turn on the air conditioner, a voice instruction including voice "turn on the air conditioner" is given.
The speech recognition apparatus 102 is configured to convert the acquired text data into a syllable sequence based on a text database such as a dictionary, and then realize speech recognition using the syllable sequence, syllable transition probability distribution, and a language model established in advance.
The process of performing speech recognition by the speech recognition apparatus based on the text data sent by the ASR engine is described in detail by the following embodiments disclosed in the present application.
As shown in fig. 2, which is a schematic flow chart of a speech recognition method disclosed in the embodiment of the present application, the speech recognition method includes the following steps:
step S201: acquiring text data corresponding to a voice signal input by a user, and converting the text data into a syllable sequence.
Based on the speech recognition system disclosed in fig. 1, the text data obtained by performing step S201 is obtained by converting the received speech signal by the ASR engine based on an automatic speech recognition technique. Step S201 is executed to convert the acquired text data into a syllable sequence based on a text database such as a dictionary.
Step S202: based on the syllable sequence and the pre-established first syllable transition probability distribution T of the corresponding user, the syllable transition probability corresponding to the syllable sequence is determined.
In the process of implementing step S202, the first syllable transition probability distribution T based on the corresponding user is established in advance for each user. That is, each user corresponds to one of its own first syllable transition probability distributions T.
In a specific implementation, the process of obtaining the corresponding first syllable transition probability distribution T by each user is as follows: training and calculating based on the syllable sequence corresponding to the voice data and the real syllables corresponding to the syllable sequence to obtain syllable transfer probability distribution, wherein the syllable transfer probability comprises the transfer probability of each syllable in the syllable sequence corresponding to the real syllable.
Alternatively, the first syllable transition probability distribution T corresponding to the user can be obtained in the following manner.
Firstly, a real syllable sequence U corresponding to a real voice sample of a user is obtained, and an ASR engine is used for identifying the real voice sample of the user to obtain an identified syllable sequence V. Wherein the real syllable sequence U = U 1 u 2 …u k Identifying syllable sequence V = V 1 v 2 …v k
Secondly, based on the true syllable sequence U and the recognized syllable sequence V, a conditional probability of recognizing each syllable in the syllable sequence V is determined.
Finally, as shown in formula (1), the product of the conditional probabilities for identifying each syllable in the syllable sequence V is calculated to obtain p (V | U). In the embodiment of the present application, p (V | U) is used to indicate the first syllable transition probability distribution T corresponding to the current user.
p(V|U)=p(v 1 |u 1 )p(v 2 |u 2 )…p(v n |u k ) (1)
This formula (1) refers to the conditional probability that the ASR engine recognizes the occurrence of the resulting recognized syllable sequence V with a real speech sample given the real syllable sequence U. The following description will be given by taking as an example the real syllable sequence U = "d a k ai k ong t iao" and the recognized syllable sequence V = "d a k e k ong t iao". As can be seen from equation (1):
p(V|U)=p(d|d)p(a|a)p(k|k)p(e|ai)p(k|k)p(t|t)p(iao|iao) (2)
for the syllable transition probability between each syllable, as can be seen from equation (2), p (e | ai) represents the probability that the user uttered the true syllable ai as e, or the probability that the identified syllable e potentially hidden the true utterance as ai. When p (ai | ai) is close to 1, it indicates that the user pronunciation is very standard on this syllable ai, otherwise it indicates that the user pronunciation is not standard, and this syllable ai will be pronounced to other syllables with a certain probability.
In a specific implementation, a syllable transition probability distribution can be represented in a matrix manner. As shown in FIG. 3, the syllable transition probability distribution of syllables such as "c o e \8230; b m s \" is shown in FIG. 3. Wherein, the closer the element on the diagonal of the matrix is to 1, the more standard the pronunciation of the user is. For example, p (s | c) =0.2 is shown in fig. 3, indicating that the standard syllable c has a probability of 0.2 being uttered into s; p (c | c) =0.8, meaning that the standard syllable c has a probability of 0.8 being pronounced as c; p (b | b) =0.9, indicating that a standard syllable b is spoken as b with a probability of 0.9; p (m | b) =0.1, indicating that the standard syllable b is spoken as m with a probability of 0.1.
Optionally, the first syllable transition probability distribution T corresponding to the user may also be obtained by calculation in a statistical manner. The method comprises the following specific steps:
firstly, a real syllable sequence U corresponding to a real voice sample of a user is obtained, and a recognized syllable sequence V obtained by recognizing the real voice sample of the user by a voice recognition engine is obtained. Wherein, the real syllable sequence U = U 1 u 2 …u k Identifying syllable sequence V = V 1 v 2 …v k
Secondly, the true value syllable U in the real syllable sequence U is divided into i And a recognized syllable V in the recognized syllable sequence V i One-to-one correspondence comparison is carried out, and any truth value syllable u is counted i The first frequency count (U) appearing in the true syllable sequence U i ) And true value syllable u i Identified syllable v of corresponding position i Second frequency of occurrence count (v) i )。
Next, using equation (3)) Based on the first frequency count (u) i ) And a second frequency count (v) i ) Determining the identified syllable v i Second syllable transition probability p (v) i |u i )。
p(v i |u i )=count(v i )/count(u i ) (3)
Finally, all the identified syllables v are counted and utilized i Second syllable transition probability p (v) i |u i ) A first syllable transition probability distribution t corresponding to a user is established.
The embodiment of the application is not limited to the two methods for obtaining the first syllable transfer probability distribution tau, optionally, a seq2seq deep neural network model can be also adopted, and the syllable sequence is directly trained to obtain the syllable transfer probability distribution of the corresponding syllable sequence.
Step S203: and determining candidate syllable texts corresponding to the syllable sequence based on the syllable transfer probability and a pre-constructed language model, and outputting the candidate syllable texts.
In step S203, the language model is pre-constructed by the existing corpus information, which may include only scene corpus. The method can also comprise one or more combinations of scene linguistic data and open linguistic data generated according to preset rules. It should be noted that the scene corpus is corpus information of the user in a specific scene. For example, corpus information obtained in an on-vehicle scenario. Or corpus information obtained in an office scenario. The application scenario of the speech recognition method is not limited in the embodiment of the present application, and any special scenario may be used, and the scenario corpus here corresponds to the corresponding scenario applied by the speech recognition method.
The specific process of constructing the language model comprises the following steps: acquiring a text sequence corresponding to the corpus information, training by adopting an n-gram method based on the corpus information, and establishing a language model corresponding to the corpus information. The language model may be transformed to obtain a graph of a weighted finite state machine corresponding to the speech model. The graph of the weighted finite state machine is associated with a syllable sequence.
The language model is used for the process of scoring and pruning the syllable sequence based on the syllable transition probability, is actually realized based on a graph of a weighted finite state machine after the language model is converted, and finally determines the syllable text corresponding to the syllable sequence with the highest syllable transition probability as a candidate syllable text and outputs the candidate syllable text.
Note that the output candidate syllable text is the optimal candidate syllable text.
In a specific implementation, a specific implementation process for determining and outputting a candidate syllable text corresponding to a syllable sequence includes the following steps:
first, a decoding graph indicating a weighted finite state machine of a pre-established language model is determined.
The decoded graph is a graph of a weighted finite state machine. The decoded image is composed of a prior probability p (W) of a text sequence W contained in a language model established in advance and a conditional probability p (U | W) that the text sequence W corresponds to a true syllable sequence U in the case where the text sequence W is determined.
Wherein, the text sequence W refers to the text sequence actually expressed by the user, W = W 1 w 2 …w m
Then, based on the transition probability of each syllable corresponding to the true syllable in the decoding image and the syllable transition probability, the optimal candidate syllable text sequence is determined and output.
As is apparent from the above description of the embodiments of the present application, the decoded graph is obtained by converting a language model, and the decoded graph may constitute a decoder. And if the corpus information adopted for forming the language model comprises scene corpora, generating corpora and open corpora according to preset rules. Optionally, the prior probability p (W) and the conditional probability p (U | W) of the text sequence W corresponding to the real syllable sequence U are connected by an embedding manner, so as to construct and obtain a decoding graph.
Specifically, the decoding graph is composed of a plurality of paths, each path includes a plurality of nodes, and the structure of the nodes is as follows: syllable text/loss, or syllable < eps >/loss. Wherein syllable means currently inputted syllable, text means output text corresponding to the syllable, < eps > means no output text, and loss means loss generated by passing through the path.
In determining an optimal candidate syllable text sequence based on the transition probability of each syllable corresponding to the true syllable in the decoded picture and the syllable transition probability: when the optimal candidate syllable text corresponding to the syllable sequence is searched in the decoding graph, on the decoding graph, starting from the first node, corresponding to each syllable in the syllable sequence, selecting a path formed by nodes which have larger transition probability corresponding to the real syllable and have the minimum loss generated by the path, determining the text sequence formed by the text corresponding to each node on the path as the optimal candidate syllable text, and outputting the optimal candidate syllable text.
In a specific implementation, the probability of obtaining the text sequence actually expressed by the user based on the syllable sequence obtained from the ASR engine can be expressed by equation (4).
p(W|V)=p(V|U)*p(U|W)*P(W)/P(V) (4)
The candidate syllable text sequence W obtained by executing the above steps S301 to S303 can be represented by (5)
W*=arg max w p(V|U)*p(U|W)*p(W) (5)
This is illustrated based on the decoding diagram shown in fig. 4.
Assuming that the voice signal is 'Calli', the corresponding syllable sequence is 'h u j iao i', if the probability that l is wrongly sent to n in the first syllable transition probability matrix is 0.6, the probability that l is sent to l is 0.4, and the rest syllables have no corresponding transition probability.
When a user with the pronunciation feature wants to say "call li", but is recognized as "call li bar" by the ASR due to pronunciation problems, the "call li bar" corresponds to the syllable sequence "h u j iao i ni". Based on the decoding graph disclosed in fig. 4, at the start node 0, since h cannot be pronounced into other syllables, the sequence goes from node 0 to node 2, and then the sequence goes along nodes 9, 10, 11, and 12. At the node 12, since p (n | l) =0.6> (n | n) =0.4, although the input at this time is l, the probability that the true pronunciation is n is higher, and the loss of passage of two paths between two nodes of the node 12 and the node 13 is the same (the prior probability p (W) and the conditional probability p (U | W) constituting the decoding graph are partially the same), the node is preferentially moved to the node 13 and finally reaches the end node 18, so that the decoding result "call li" is obtained.
By means of the method for scoring and pruning the syllable sequence by combining the decoding image and the syllable transition probability, the optimal candidate syllable text is finally determined, and the aim of improving the recognition accuracy in the voice recognition process is fulfilled.
It should be noted that, in the process of executing step S202, the first syllable transition probability distribution involved in the implementation can be determined by the syllable transition probability distributions obtained by training different corpora. The trained speech data includes Mandarin corpus samples, accent population corpus samples, and user history corpus samples.
Training is carried out based on the Mandarin corpus sample, the accent group corpus sample and the user historical corpus sample, and syllable transfer probability distribution corresponding to the Mandarin corpus sample, the accent group corpus sample and the user historical corpus sample are obtained respectively. As shown in equation (6), the relationship between the first syllable transition probability distribution T disclosed in the embodiment of the present application and the syllable transition probability distribution obtained by training based on the different corpora is:
T=w 1 T m +w 2 T g +w 3 T p (6)
wherein T is m A second syllable transition probability distribution, T, obtained for Mandarin corpus sample training g A third syllable transition probability distribution, T, obtained for speech sample training based on accent population p And the fourth syllable transfer probability distribution is obtained based on the real corpus sample training of the user. w is a 1 Transferring a probability distribution T for a second syllable m Corresponding weight, w 2 Transition probability distribution for third syllable g Corresponding weight, w 3 A fourth syllable transition probability distribution tau p Corresponding weight, w 1 +w 2 +w 3 =1
Note that the second syllable transition probability distribution t m A third syllable transition probability distribution tau g The weight corresponding to each of the weight values,the type and confidence level determination of the user's accent given by the accent recognition classifier may be employed. w is a 2 Can reflect the degree of accent of the user, w 3 The personalized pronunciation characteristics of the user can be embodied.
Second syllable transition probability distribution T m Corresponding weight w 1 And a third syllable transition probability distribution T g Corresponding weight w 2 Will gradually decrease over time. A fourth syllable transition probability distribution tau p Corresponding weight w 3 The weight w 3 The updating of the probability distribution of the first syllable transition is usually time-dependent, with the longer the time, the higher the weight value.
Weight w 1 Weight w 2 And a weight w 3 The update principle of (2) is shown in formula (7):
Figure GDA0002709093660000091
where f (t) is a function of the decay with time t.
Alternatively, it may take 1/t (t)>1) And the like. G (wav) refers to an accent recognition classification model for indicating to which accent a certain person's speech can be recognized and to which probability. G (wav) m Probability of speech data belonging to Mandarin, G (wav) g Representing the probability that the speech data belongs to a certain group of accents.
With respect to updating the first syllable transition probability distribution, the prevalence of the personalization data is relatively important. In specific implementation, because collection or statistics of personalized data is influenced by time factors, in the process of updating the first syllable transition probability distribution T, only aiming at a fourth syllable transition probability distribution T obtained by training a real corpus sample based on a user p And (6) updating. In particular, for T p (x,y)>The position of 0 is updated. The specific updating process is as follows:
firstly, obtaining a fourth syllable transition probability distribution T in a preset time period p Syllable transition probability T corresponding to any syllable p (x,y)。Wherein x and y are used to indicate a syllable transition probability T p (x, y) at a fourth syllable transition probability distribution T p Is measured.
Second, when syllable transition probability T p (x, y) greater than 0, utilizing syllable transfer probability T p (x, y) updating the first syllable transition probability distribution T.
The specific updating process is shown in formula (8):
T(x,y)=w 1 T m (x,y)+w 2 T g (x,y)+w 3 T p (x,y),T p (x,y)>0 (8)
second, when syllable transition probability T p (x, y) equal to 0, probability of syllable transfer T g (x, y) assignment to syllable transition probability T p (x, y) updating the first syllable transition probability distribution T.
The specific updating process is shown in formula (9):
T(x,y)=w 1 T m (x,y)+(w 2 +w 3 )T g (x,y),T p (x,y)<0 (9)
wherein, syllable transfer probability T g (x, y) refers to the accent recognition probability corresponding to a syllable.
That is, when T p (x, y) equal to 0, no Gamma p Updating by means of T g And (6) updating. After the final calculation and updating are finished, the first syllable transfer probability distribution T is normalized again, and the updating of the first syllable transfer probability distribution T is finished.
It should be noted that, for the update of the first syllable transition probability distribution T, optionally, the first syllable transition probability distribution T may also be updated based on a semi-supervised method.
Optionally, after determining the candidate syllable text corresponding to the first syllable sequence and outputting the candidate syllable text, obtaining feedback information whether the candidate syllable text fed back by the user is consistent with the voice signal, and if the feedback information is that the candidate syllable text is consistent with the voice signal, updating the first syllable transition probability distribution T based on the first syllable sequence corresponding to the candidate syllable text.
By the voice recognition method provided by the embodiment of the application, the syllable transition probability of the syllable sequence corresponding to the voice signal input by the current user is determined based on the pre-established syllable transition probability distribution of the corresponding user, and then the candidate syllable text corresponding to the voice signal is determined and output by using the determined syllable transition probability and the pre-established voice model. In the embodiment of the application, on one hand, the personalized pronunciation characteristics of each user are indicated by utilizing the syllable transfer probability distribution, and the aim of improving the recognition accuracy in the voice recognition process is fulfilled. On the other hand, since the data amount required for the syllable transition probability distribution is small, it can be embedded in various devices for use, and the purpose of realizing personalized speech recognition on the terminal side is achieved.
Based on the voice recognition method disclosed by the embodiment of the application, the embodiment of the application also correspondingly discloses a voice recognition device for realizing the voice recognition method.
Fig. 5 is a schematic structural diagram of a speech recognition apparatus 500 disclosed in an embodiment of the present application. The speech recognition apparatus 500 includes: a syllable transfer module 501 and a decoding module 502.
A syllable transition module 501, configured to obtain text data corresponding to a voice signal input by a user, convert the text data into a syllable sequence, where the text data is obtained by converting the voice signal by a voice recognition engine, and determine a syllable transition probability corresponding to the syllable sequence based on the syllable sequence and a pre-established first syllable transition probability distribution T corresponding to the user, where the syllable transition probability includes a transition probability of each syllable in the syllable sequence corresponding to a real syllable.
And the decoding module 502 is used for determining candidate syllable texts corresponding to the syllable sequences based on the syllable transition probability and the pre-constructed language model and outputting the candidate syllable texts.
The decoding module 502 is specifically configured to determine a decoding graph of a weighted finite state machine indicating the pre-established language model, determine candidate syllable texts based on the decoding graph and the transition probabilities of each syllable corresponding to the true syllables in the syllable transition probabilities, and output the candidate syllable texts.
Wherein the decoding graph consists of a prior probability p (W) of a text sequence W contained in the pre-established language model and a conditional probability p (UW) of the text sequence corresponding to a real syllable sequence U under the condition that the text sequence W is determined.
Optionally, the speech recognition apparatus 500 further includes a first probability calculation module.
The first probability calculation module is used for obtaining a real syllable sequence U corresponding to a real voice sample of a user, and an identification syllable sequence V obtained by identifying the real voice sample of the user by a voice identification engine, determining a conditional probability of identifying each syllable in the identification syllable sequence V based on the real syllable sequence U and the identification syllable sequence V, and calculating a product of the conditional probabilities of identifying each syllable in the syllable sequence V to obtain a first syllable transition probability distribution T, wherein the first syllable transition probability distribution T is p (V | U).
Optionally, the speech recognition apparatus 500 further includes a second probability calculation module.
The second probability calculation module is used for acquiring a real syllable sequence U corresponding to the real voice sample of the user and a recognition syllable sequence V obtained by recognizing the real voice sample of the user by the voice recognition engine; true value syllable U in true syllable sequence U i And identifying the identified syllable V in the syllable sequence V i One-to-one correspondence comparison is carried out to count any truth value syllable u i The first frequency occurring in the real syllable sequence U, and the true value syllable U i The recognized syllable v of the corresponding position i A second frequency of occurrence; determining the recognized syllable v based on the first frequency and the second frequency i Second syllable transition probability p (v) i |u i ) (ii) a Counting and using all recognized syllables v i Second syllable transition probability p (v) i |u i ) And establishing a first syllable transition probability distribution T of the corresponding user, wherein the first syllable transition probability distribution T is p (V | U).
Optionally, the speech recognition apparatus 500 further includes a first updating module.
And the first updating module is used for updating the first syllable transfer probability distribution T based on a semi-supervised method.
Optionally, the speech recognition apparatus 500 further includes a second updating module.
The second updating module is used for acquiring feedback information of whether the candidate syllable text fed back by the user is consistent with the voice signal; and if the feedback information is that the candidate syllable text is consistent with the voice signal, updating the first syllable transition probability distribution T based on the first syllable sequence corresponding to the candidate syllable text.
Optionally, the speech recognition apparatus 500 further includes a setup module.
And the establishing module is used for acquiring the text sequence corresponding to the corpus information and establishing a corresponding language model based on the text sequence corresponding to the corpus information. The corpus information includes scene corpus, or the scene corpus is combined with one or more of corpus and open corpus generated according to preset rules. The scene corpus is the corpus information of the user in a specific scene.
The execution principle and process of each module in the speech recognition apparatus disclosed in the embodiment of the present application are the same as the execution principle and process corresponding to the corresponding step in the speech recognition method disclosed in the embodiment of the present application, and reference may be made to the corresponding part in the speech recognition method disclosed in the embodiment of the present application, which is not described herein again.
In combination with the speech recognition method disclosed in fig. 2 of the embodiment of the present application and the speech recognition apparatus disclosed in fig. 5 of the embodiment of the present application, that is, the speech recognition apparatus disclosed in fig. 1 of the embodiment of the present application, the speech recognition method and apparatus can also be directly implemented by a controller formed by hardware, a memory executed by a processor, or a combination of the two. As shown in fig. 6, a schematic structural diagram of a controller for implementing a speech recognition method disclosed in the embodiment of the present application may be further disposed in a main control system or a processing system in a terminal side. The controller 600 includes: a memory 601, and a processor 602 and a communication interface 603 in communication with the memory.
The processor 602 is coupled to the memory 601 through a bus. The processor 602 is coupled to the communication interface 603 via a bus.
The processor 602 may be a Central Processing Unit (CPU), a Network Processor (NP), an application-specific integrated circuit (ASIC), or a Programmable Logic Device (PLD). The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), or a General Array Logic (GAL).
The memory 601 may specifically be a content-addressable memory (CAM) or a random-access memory (RAM). The CAM may be a Ternary Content Addressable Memory (TCAM).
The communication interface 603 may be a wired interface, such as a Fiber Distributed Data Interface (FDDI) or ethernet interface.
The memory 601 may also be integrated in the processor 602. If the memory 601 and the processor 602 are independent devices, the memory 601 and the processor 602 are connected, for example, the memory 601 and the processor 602 can communicate via a bus. The communication interface 603 and the processor 602 may communicate via a bus, and the communication interface 603 may be directly connected to the processor 602.
A memory 601 for storing program code for speech recognition. Optionally, the memory 601 includes an operating system and an application program, which are used to carry the operating program, codes or instructions of the speech recognition method disclosed in the embodiment of the present application.
When the processor 602 or the hardware device is to perform the operations related to the speech recognition method disclosed in the embodiment of the present application, the processor calls and executes the operating program, code, or instructions stored in the memory 601 to complete the processes of the speech recognition method described in the embodiment of the present application. The specific process is as follows: the processor 602 invokes the program code for speech recognition in the memory 601 to execute the speech recognition method.
Processor 602 implements the steps in the method embodiments by calling a program in memory 601. The processor 602 may also be a physical device that embodies the speech recognition apparatus 500.
It is to be understood that the operations of receiving/transmitting and the like involved in the above-mentioned voice recognition apparatus and voice recognition method embodiment shown in fig. 5 may be receiving/transmitting processing implemented by a processor, or may be transmitting/receiving processes implemented by a receiver and a transmitter, and the receiver and the transmitter may exist independently or may be integrated into a transceiver.
In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.
In summary, with the voice recognition method and the voice recognition apparatus provided in the embodiments of the present application, based on the pre-established syllable transition probability distribution of the corresponding user, the syllable transition probability of the syllable sequence corresponding to the voice signal input by the current user is determined, and then the candidate syllable text corresponding to the voice signal is determined and output by using the determined syllable transition probability and the pre-established voice model. In the embodiment of the application, on one hand, the personalized pronunciation characteristics of each user are indicated by using syllable transition probability distribution, and the aim of improving the recognition accuracy in the voice recognition process is fulfilled. On the other hand, since the data amount required for the syllable transition probability distribution is small, it can be embedded in various devices for use, and the purpose of realizing personalized speech recognition on the terminal side is achieved.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application and the benefits derived therefrom have been described in detail with reference to the foregoing embodiments, those skilled in the art will appreciate that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; but such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present claims.

Claims (17)

1. A speech recognition method, comprising:
acquiring text data corresponding to a voice signal input by a user, converting the text data into a syllable sequence, and converting the voice signal into the text data by a voice recognition engine;
determining syllable transition probabilities corresponding to the sequence of syllables based on the sequence of syllables and a first syllable transition probability distribution T corresponding to the user, the syllable transition probabilities including a transition probability that each syllable in the sequence of syllables corresponds to a true syllable;
and determining candidate syllable texts corresponding to the syllable sequences based on the syllable transfer probability and a pre-constructed language model, and outputting the candidate syllable texts.
2. The method of claim 1, wherein the pre-established procedure of the first syllable transition probability distribution T corresponding to the user comprises:
acquiring a real syllable sequence U corresponding to a real voice sample of a user, and a recognized syllable sequence V obtained by recognizing the real voice sample of the user by a voice recognition engine;
determining a conditional probability for each syllable in the identified syllable sequence V based on the true syllable sequence U and the identified syllable sequence V;
and calculating the product of the conditional probability of each syllable in the identified syllable sequence V to obtain a first syllable transfer probability distribution T, wherein the first syllable transfer probability distribution T is p (V | U).
3. The method of claim 1, wherein the pre-established procedure for the first syllable transition probability distribution T corresponding to the user comprises:
acquiring a real syllable sequence U corresponding to a real voice sample of a user, and a recognized syllable sequence V obtained by recognizing the real voice sample of the user by a voice recognition engine;
dividing true value syllable U in the real syllable sequence U i And a recognized syllable V in the recognized syllable sequence V i Comparing the real values one by one, and counting any real value syllable u i The first frequency occurring in the real syllable sequence U, and the true value syllable U i The recognized syllable v of the corresponding position i A second frequency of occurrence;
determining the identified syllable v based on the first frequency and the second frequency i Second syllable transition probability p (v) i |u i );
Counting and using all said identified syllables v i Second syllable transition probability p (v) i |u i ) Establishing a first syllable transition probability distribution T corresponding to the user, wherein the first syllable transition probability distribution T is p (V | U).
4. The method according to any one of claims 1-3, further comprising:
determining a second syllable transition probability distribution T based on Mandarin corpus sample training m Third syllable transition probability distribution T obtained based on accent group corpus sample training g And a fourth syllable transition probability distribution T obtained based on the real corpus sample training of the user p
Obtaining the fourth syllable transition probability distribution T in the preset time period p Syllable transition probability T corresponding to any syllable p (x, y), x and y are used to indicate the syllable transition probability T p (x, y) probability distribution T of transition at said fourth syllable p The coordinate position of (2);
when the syllable transition probability T p (x, y) is greater than 0, using the syllable transition probability T p (x, y) updating the first syllable transition probability distribution T;
when the syllable transition probability T p (x, y) is less than 0, the syllable is transferred with probability T g (x, y) assigning a value to the syllable transition probability T p (x, y) updating the first syllable transition probability distribution T, the syllable transition probability T g (x, y) refers to the accent recognition probability corresponding to the syllable.
5. The method according to any one of claims 1-3, further comprising:
updating the first syllable transition probability distribution T based on a semi-supervised method.
6. The method according to any one of claims 1-3, wherein after said determining the candidate syllable text to which the syllable sequence corresponds and outputting, further comprising:
obtaining feedback information of whether the candidate syllable text fed back by the user is consistent with the voice signal;
if the feedback information is that the candidate syllable text is consistent with the voice signal, updating the first syllable transition probability distribution T based on the syllable sequence corresponding to the candidate syllable text.
7. A method according to any one of claims 1 to 3, wherein the pre-building of the language model comprises:
acquiring a text sequence corresponding to corpus information, wherein the corpus information comprises scene corpus, or the scene corpus is combined with one or more of corpus and open corpus generated according to a preset rule, and the scene corpus is the corpus information of a user in a specific scene;
and establishing a corresponding language model based on the corpus information.
8. The method according to any one of claims 1-3, wherein the determining and outputting candidate syllable texts corresponding to the syllable sequence based on the syllable transition probabilities and a pre-constructed language model comprises:
determining a decoding graph of a weighted finite state machine for indicating the pre-constructed language model, the decoding graph being composed of a prior probability p (W) of a text sequence W contained in the pre-constructed language model and a conditional probability p (U | W) that the text sequence corresponds to a real syllable sequence U in case the text sequence W is determined;
and determining candidate syllable texts based on the decoding graph and the transfer probability of each syllable corresponding to the real syllable in the syllable transfer probability, and outputting the candidate syllable texts.
9. A speech recognition apparatus, comprising:
the system comprises a syllable transfer module, a voice recognition module and a processing module, wherein the syllable transfer module is used for acquiring text data corresponding to a voice signal input by a user, converting the text data into a syllable sequence, converting the voice signal by a voice recognition engine to obtain the text data, and determining syllable transfer probability corresponding to the syllable sequence based on the syllable sequence and a first syllable transfer probability distribution T corresponding to the user, wherein the syllable transfer probability comprises the transfer probability of each syllable in the syllable sequence corresponding to a real syllable;
and the decoding module is used for determining candidate syllable texts corresponding to the syllable sequence based on the syllable transition probability and a pre-constructed language model and outputting the candidate syllable texts.
10. The speech recognition device of claim 9, further comprising:
the system comprises a first probability calculation module, a voice recognition engine, a first syllable transition probability distribution module and a second syllable transition probability distribution module, wherein the first probability calculation module is used for obtaining a real syllable sequence U corresponding to a real voice sample of a user, and a recognition syllable sequence V obtained by recognizing the real voice sample of the user through the voice recognition engine, determining a conditional probability of each syllable in the recognition syllable sequence V based on the real syllable sequence U and the recognition syllable sequence V, and calculating the product of the conditional probabilities of each syllable in the recognition syllable sequence V to obtain a first syllable transition probability distribution T, and the first syllable transition probability distribution T is p (V | U).
11. The speech recognition device of claim 9, further comprising:
the second probability calculation module is used for acquiring a real syllable sequence U corresponding to a real voice sample of a user and an identified syllable sequence V obtained by identifying the real voice sample of the user by a voice identification engine; the true value syllable U in the real syllable sequence U is divided into i And a recognized syllable V in the recognized syllable sequence V i Comparing the real values one by one, and counting any real value syllable u i The first frequency occurring in the real syllable sequence U, and the true value syllable U i Identified syllable v of corresponding position i A second frequency of occurrence; determining the identified syllable v based on the first frequency and the second frequency i Second syllable transition probability p (v) i |u i ) (ii) a Counting and using all of the identified syllables v i Second syllable transition probability p (v) i |u i ) Establishing a first syllable transition probability distribution T corresponding to the user, wherein the first syllable transition probability distribution T is p (V | U).
12. The speech recognition device according to any one of claims 9 to 11, further comprising:
a first updating module for updating the first syllable transition probability distribution T based on a semi-supervised method.
13. The speech recognition device according to any one of claims 9 to 11, further comprising:
the second updating module is used for obtaining feedback information whether the candidate syllable text fed back by the user is consistent with the voice signal; if the feedback information is that the candidate syllable text is consistent with the voice signal, updating the first syllable transition probability distribution T based on the syllable sequence corresponding to the candidate syllable text.
14. The speech recognition device according to any one of claims 9 to 11, further comprising:
the establishing module is used for obtaining a text sequence corresponding to the corpus information, wherein the corpus information comprises scene corpus, or the scene corpus is combined with one or more of corpus and open corpus generated according to a preset rule, the scene corpus is the corpus information of a user in a specific scene, and a corresponding language model is established based on the corpus information.
15. The speech recognition device of any one of claims 9-11, wherein the decoding module is configured to determine a decoding map indicating a weighted finite state machine of the pre-constructed language model, determine a candidate syllable text based on the decoding map and a transition probability of each syllable corresponding to a true syllable, and output the candidate syllable text;
wherein the decoding graph consists of a prior probability p (W) of a text sequence W contained in the pre-constructed language model and a conditional probability p (UW) of the text sequence corresponding to a real syllable sequence U if the text sequence W is determined.
16. A controller, comprising: a memory, and a processor in communication with the memory;
the memory for storing program code for speech recognition;
the processor, program code for invoking the speech recognition in the memory, to perform the speech recognition method of any of claims 1-8.
17. A non-transitory computer-readable storage medium storing a computer program comprising instructions for performing the speech recognition method of any one of claims 1-8.
CN201811639786.6A 2018-12-29 2018-12-29 Voice recognition method, device and controller Active CN111383641B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811639786.6A CN111383641B (en) 2018-12-29 2018-12-29 Voice recognition method, device and controller

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811639786.6A CN111383641B (en) 2018-12-29 2018-12-29 Voice recognition method, device and controller

Publications (2)

Publication Number Publication Date
CN111383641A CN111383641A (en) 2020-07-07
CN111383641B true CN111383641B (en) 2022-10-18

Family

ID=71216744

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811639786.6A Active CN111383641B (en) 2018-12-29 2018-12-29 Voice recognition method, device and controller

Country Status (1)

Country Link
CN (1) CN111383641B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185417A (en) * 2020-10-21 2021-01-05 平安科技(深圳)有限公司 Method and device for detecting artificially synthesized voice, computer equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103578464A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Language model establishing method, speech recognition method and electronic device
US9471566B1 (en) * 2005-04-14 2016-10-18 Oracle America, Inc. Method and apparatus for converting phonetic language input to written language output
CN106774975A (en) * 2016-11-30 2017-05-31 百度在线网络技术(北京)有限公司 Input method and device
CN107154260A (en) * 2017-04-11 2017-09-12 北京智能管家科技有限公司 A kind of domain-adaptive audio recognition method and device
CN107678561A (en) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 Phonetic entry error correction method and device based on artificial intelligence
CN108182937A (en) * 2018-01-17 2018-06-19 出门问问信息科技有限公司 Keyword recognition method, device, equipment and storage medium
CN108364651A (en) * 2017-01-26 2018-08-03 三星电子株式会社 Audio recognition method and equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9471566B1 (en) * 2005-04-14 2016-10-18 Oracle America, Inc. Method and apparatus for converting phonetic language input to written language output
CN103578464A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Language model establishing method, speech recognition method and electronic device
CN106774975A (en) * 2016-11-30 2017-05-31 百度在线网络技术(北京)有限公司 Input method and device
CN108364651A (en) * 2017-01-26 2018-08-03 三星电子株式会社 Audio recognition method and equipment
CN107154260A (en) * 2017-04-11 2017-09-12 北京智能管家科技有限公司 A kind of domain-adaptive audio recognition method and device
CN107678561A (en) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 Phonetic entry error correction method and device based on artificial intelligence
CN108182937A (en) * 2018-01-17 2018-06-19 出门问问信息科技有限公司 Keyword recognition method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN111383641A (en) 2020-07-07

Similar Documents

Publication Publication Date Title
CN109785828B (en) Natural language generation based on user speech styles
US10878807B2 (en) System and method for implementing a vocal user interface by combining a speech to text system and a speech to intent system
WO2019149108A1 (en) Identification method and device for voice keywords, computer-readable storage medium, and computer device
US10706852B2 (en) Confidence features for automated speech recognition arbitration
CN110517664B (en) Multi-party identification method, device, equipment and readable storage medium
US11043205B1 (en) Scoring of natural language processing hypotheses
CN111583909B (en) Voice recognition method, device, equipment and storage medium
CN111656366A (en) Method and system for intent detection and slot filling in spoken language dialog systems
CN109273007B (en) Voice wake-up method and device
WO2021190259A1 (en) Slot identification method and electronic device
CN111191016A (en) Multi-turn conversation processing method and device and computing equipment
CN106875936B (en) Voice recognition method and device
US11081104B1 (en) Contextual natural language processing
US11574637B1 (en) Spoken language understanding models
CN113220839B (en) Intention identification method, electronic equipment and computer readable storage medium
CN112581938B (en) Speech breakpoint detection method, device and equipment based on artificial intelligence
CN110827803A (en) Method, device and equipment for constructing dialect pronunciation dictionary and readable storage medium
CN113470619A (en) Speech recognition method, apparatus, medium, and device
CN111402894A (en) Voice recognition method and electronic equipment
EP3980991B1 (en) System and method for recognizing user&#39;s speech
WO2023078370A1 (en) Conversation sentiment analysis method and apparatus, and computer-readable storage medium
CN114596844A (en) Acoustic model training method, voice recognition method and related equipment
CN113793599B (en) Training method of voice recognition model, voice recognition method and device
CN112201275A (en) Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium
CN111883121A (en) Awakening method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant