WO2020001458A1 - 语音识别方法、装置及系统 - Google Patents

语音识别方法、装置及系统 Download PDF

Info

Publication number
WO2020001458A1
WO2020001458A1 PCT/CN2019/092935 CN2019092935W WO2020001458A1 WO 2020001458 A1 WO2020001458 A1 WO 2020001458A1 CN 2019092935 W CN2019092935 W CN 2019092935W WO 2020001458 A1 WO2020001458 A1 WO 2020001458A1
Authority
WO
WIPO (PCT)
Prior art keywords
wfst
acoustic
path
pronunciation
probability
Prior art date
Application number
PCT/CN2019/092935
Other languages
English (en)
French (fr)
Inventor
杨占磊
肖龙帅
黄茂胜
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2020001458A1 publication Critical patent/WO2020001458A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • the present invention relates to the field of computer technology, and particularly to the field of speech recognition technology.
  • Speech recognition refers to a technology that recognizes the corresponding text content from the speech waveform, and is one of the important technologies in the field of artificial intelligence.
  • Decoder is one of the core modules of speech recognition technology.It can build a recognition network based on trained acoustic models, language models and pronunciation dictionaries. Each path in the network is identified with various text information and text information. Correspond to the pronunciation, and then look for the largest path in the recognition network for the pronunciation output by the acoustic model. Based on the path, the text information corresponding to the voice signal can be output with the greatest probability to complete the voice recognition.
  • a method of speech recognition has been proposed in the prior art.
  • the solution is to use a combination of Mandarin Chinese phonetic knowledge guidance and training data drive.By establishing a decision tree, the model can be used to share parameters at the state level and establish contextual correlation. Acoustic model, the acoustic model is built on the level of vowels and finals.
  • This technical solution designs a set of phonetic problems for the decision tree construction algorithm to use, which can extract the distinguishing features of Mandarin Chinese speech, such as: clear, dull, nasal, Non-nasal sounds (here, pronunciation features are one of the pronunciation attributes).
  • the blindness of model matching is reduced through the decision tree pair, the efficiency and accuracy of the search is improved, and the accuracy and trainability of the model are overcome. Sexual conflict.
  • This solution clusters the acoustic model through the pronunciation attribute information, and realizes the application of more acoustic models to improve the system performance.
  • the construction of the acoustic model is still based on the statistical characteristics of the speech features.
  • the speech features When applied to the medium and long distance interactive scene At this time, because the speech will be disturbed by the noise and reverberation of the surrounding environment, the statistical characteristics of the speech features will change, which will cause the performance of the acoustic model to drop sharply, and the accuracy of speech recognition will be low.
  • the embodiments of the present application provide a speech recognition decoding method, system, and device, and a corresponding speech recognition weighted finite state converter WFST construction method, system, and device.
  • An embodiment of the present application provides, on the one hand, a method for constructing a speech recognition WFST.
  • the method includes: constructing an acoustic WFST (H1), the acoustic WFST is a search network from acoustic features to pronunciation attributes; constructing a pronunciation WFST ( A), the pronunciation WFST is a search network from pronunciation attributes to phonemes; construct a dictionary WFST (L), the dictionary WFST is a search network from phonemes to words or words; construct a language WFST (G), the language WFST Is a search network of word or word-to-word sequence; integrating multiple WFSTs to generate speech recognition WFST; wherein the multiple WFSTs include: the acoustic WFST, the pronunciation WFST, the dictionary WFST, and the language WFST;
  • the integrated speech recognition WFST is a search network from acoustic features to word sequences, and is expressed as H1 * A * L * G.
  • the acoustic weighted finite state converter WFST (H1) is constructed as follows: the pronunciation attribute is used as the state, the acoustic feature is used as the observation, and the HMM hidden Markov model is combined with the forward-backward algorithm , Expectation maximization algorithm, Viterbi, obtain the probability of generating a given acoustic feature under the condition of pronunciation attributes, and construct the acoustic WFST based on the probability.
  • the pronunciation WFST (A) is constructed as follows: the deep neural network takes acoustic features as input, and uses phonemes and pronunciation attributes as dual targets to output, and obtains the phoneme and pronunciation attributes with the highest probability as Co-occurrence of pronunciation attributes and phonemes. The number of co-occurrences of pronunciation attributes and phonemes is counted through the input and output of a large number of speech libraries, and divided by the total number of frames to obtain the co-occurrence probability of pronunciation attributes and phonemes. Its co-occurrence probability is expressed as the pronunciation WFST. The input of the pronunciation WFST state transition is the pronunciation attribute, and the output is the phoneme and the co-occurrence probability of the pronunciation attribute and the phoneme.
  • a second acoustic WFST (H2) is constructed.
  • the second acoustic WFST is a search network from acoustic features to phonemes; the integrated multiple WFSTs further include: a second acoustic WFST.
  • the weighted finite state converter after integration is (H1 * A + H2) * L * G.
  • the integration step is to integrate the acoustic features obtained by integrating the acoustic WFST and the pronunciation WFST into the phoneme's WFST and the first
  • the two acoustic WFSTs are combined by the network to generate an acoustic feature to the phoneme's WFST; then the dictionary WFST is integrated with the language WFST, and the resulting finite state converter is then integrated with the above-mentioned network's combined acoustic features to the phoneme's WFST to generate speech recognition WFST .
  • the network merging is merging the same paths of two WFSTs with the same input and output types, combining them with probability, and maintaining different paths.
  • the integration process further includes performing a determination and a minimization process.
  • An embodiment of the present application provides, on the one hand, a method for constructing a speech recognition WFST, the method including: constructing an acoustic weighted finite state converter WFST (H1), the acoustic WFST is a search network from acoustic features to pronunciation attributes Constructing a pronunciation WFST (A), which is a search network from pronunciation attributes to context-sensitive phonemes; constructing a context WFST (C), which is a search network from context-dependent phonemes to phonemes; constructing a dictionary WFST ( L), the dictionary WFST is a search network from phonemes to words or words; building a language WFST (G), the language WFST is a search network from words or words to word sequences; integrating multiple WFSTs to generate speech recognition WFST
  • the plurality of WFSTs include: the acoustic WFST, the pronunciation WFST, the context WFST, the dictionary WFST, and the language WFST; the integrated speech
  • the integration step is specifically to integrate the dictionary WFST and the language model WFST, and obtain the finite state converter and then the context WFST to perform the integration operation, and then the result and the pronunciation WFST to perform the integration operation, and further perform the acoustic WFST. Integration.
  • the acoustic weighted finite state converter WFST (H1) is constructed as follows: the pronunciation attribute is used as the state, the acoustic feature is used as the observation sequence, and the HMM hidden Markov model is combined with the forward The algorithm, the expectation maximization algorithm, and Viterbi, obtain the probability of generating a given observation (acoustic feature) under the condition of the pronunciation attribute, and construct the acoustic WFST based on the probability.
  • the pronunciation WFST (A) is constructed as follows: the deep neural network takes acoustic features as input, and uses phonemes and pronunciation attributes as dual targets to output, and obtains the phoneme and pronunciation attributes with the highest probability as Co-occurrence of pronunciation attributes and phonemes. The number of co-occurrences of pronunciation attributes and phonemes is counted through the input and output of a large number of speech libraries, and divided by the total number of frames to obtain the co-occurrence probability of pronunciation attributes and phonemes. Its co-occurrence probability is expressed as the pronunciation WFST. The input of the pronunciation WFST state transition is the pronunciation attribute, and the output is the phoneme and the co-occurrence probability of the pronunciation attribute and the phoneme.
  • the method further includes: constructing a second acoustic WFST, the second acoustic WFST is a search network from acoustic features to context-dependent phonemes.
  • the weighted finite state converter after integration is (H1 * A + H2) * C * L * G.
  • the integration step is to integrate the acoustic features obtained by integrating the acoustic WFST and the pronunciation WFST into the context-dependent phoneme WFST Network merge with the second acoustic WFST to generate an acoustic feature to the context-dependent phoneme WFST; then integrate the dictionary WFST with the language WFST, and the resulting finite state converter is then integrated with the context WFST, and the integration result is then merged with the above network
  • the acoustic features of WFST are integrated into phonemes to generate speech recognition WFST.
  • the network merging is merging the same paths of two WFSTs with the same input and output types, combining them with probability, and maintaining different paths.
  • the integrating further includes performing a deterministic and minimizing process.
  • An embodiment of the present application further provides a speech recognition decoding method, the method includes: receiving a speech signal; extracting an acoustic feature from the speech signal; inputting the acoustic feature into a speech recognition WFST to obtain acoustics Probability of each path from feature to word sequence; compare the probabilities of each path, and the word sequence corresponding to the path with the highest probability is output as the recognition result.
  • the speech recognition WFST is a search network from acoustic features to word sequences generated by integrating acoustic WFST, pronunciation WFST, context WFST, dictionary WFST, and language WFST.
  • the acoustic WFST is a search network from acoustic features to pronunciation attributes;
  • the pronunciation WFST is a search network from pronunciation attributes to context-dependent phonemes;
  • the context WFST is a search network from context-dependent phonemes to phonemes;
  • the dictionary WFST is a search network from phonemes to words or words;
  • the language WFST is a search network from words or words to word sequences.
  • the speech recognition WFST is a search network from acoustic features to word sequences generated by integrating acoustic WFST, pronunciation WFST, dictionary WFST, and language WFST.
  • the acoustic WFST is a search network from acoustic features to pronunciation attributes;
  • the pronunciation WFST is a search network from pronunciation attributes to phonemes;
  • the dictionary WFST is a search network from phonemes to words or words;
  • the The language WFST is a search network of word or word-to-word sequences.
  • the embodiment of the present application also provides a speech recognition decoding method, the method includes: receiving a speech signal; extracting an acoustic feature sequence from the speech signal; and sequentially inputting the acoustic feature sequence into an acoustic WFST network To obtain the probability of each path from the acoustic feature to the pronunciation attribute; use the pronunciation attribute output from each path of the acoustic feature to the pronunciation attribute as the input of the pronunciation WFST network to obtain the probability of each path from the pronunciation attribute to the context-dependent phoneme; The context-dependent phonemes output from each path to the context-dependent phoneme are used as the input of the context WFST network, and the probability of obtaining the paths from the context-dependent phoneme to the phoneme is used.
  • Probability of each path from phoneme to word or word take the word or word output from each path of phoneme to word or word as input to language WFST network, get the probability of each path of word or word to word sequence; according to each WFST network Probability of each path to obtain each path of the acoustic feature sequence to the word sequence The total probability of the path, and output the word sequence corresponding to the path with the largest total probability as the recognition result corresponding to the acoustic feature sequence.
  • the speech recognition decoding method further includes: the calculation method of the total probability is a summation or a product operation.
  • the embodiment of the present application also provides a speech recognition decoding method, the method includes: receiving a speech signal; extracting an acoustic feature sequence from the speech signal; and sequentially inputting the acoustic feature sequence into an acoustic WFST network To obtain the probability of each path from the acoustic feature to the pronunciation attribute; use the acoustic feature as the input of the second acoustic WFST network to obtain the probability of each path from the acoustic feature to the context-dependent phoneme; and output from each path of the acoustic feature to the pronunciation attribute
  • the pronunciation attribute is used as the input of the pronunciation WFST network to obtain the probability of each path of the pronunciation attribute to the context-dependent phoneme; the context-dependent phoneme output from each path of the pronunciation attribute to the context-dependent phoneme and the context-dependent phoneme output by the second acoustic WFST network are used as The input of the context WFST network is used to obtain the probability of each path from the context-dependent
  • the speech recognition decoding method further includes: the calculation method of the total probability is a summation or a product operation.
  • a device for constructing a speech recognition WFST includes: a processor; the processor is configured to be coupled with a memory; and read and execute instructions in the memory.
  • the instructions include: constructing an acoustic weighted finite state converter WFST, the acoustic WFST is a search network from acoustic features to pronunciation attributes; constructing a pronunciation WFST, the pronunciation WFST is a search network from pronunciation attributes to phonemes; constructing a dictionary WFST, so
  • the dictionary WFST is a search network from phonemes to words or words; constructing a language WFST, the language WFST is a search network from words or words to word sequences; integrating multiple WFSTs to generate speech recognition WFST; wherein the multiple WFSTs It includes: the acoustic WFST, the pronunciation WFST, the dictionary WFST, and the language WFST; the integrated speech recognition WFST is a search
  • the instructions further include: constructing a second acoustic WFST, the second acoustic WFST is a search network from acoustic features to pronunciation attributes; and integrating multiple WFSTs to generate a speech recognition WFST, wherein the multiple Each WFST includes: the second acoustic WFST.
  • a device for constructing a speech recognition WFST includes: a processor; the processor is configured to be coupled with a memory; and read and execute instructions in the memory.
  • the instructions include: constructing an acoustic weighted finite state converter WFST, the acoustic WFST is a search network from acoustic features to pronunciation attributes; constructing a pronunciation WFST, the pronunciation WFST is a search network from pronunciation attributes to context-dependent phonemes; constructing a context WFST
  • the context WFST is a search network from context-dependent phonemes to phonemes; construct a dictionary WFST, the dictionary WFST is a search network from phonemes to words or words; construct a language WFST, the language WFST is a sequence of words or words to words Search network; integrating multiple WFSTs to generate speech recognition WFSTs; wherein the multiple WFSTs include: the acoustic WFST, the pronunciation WFST, the pronunciation WFST is a search network from
  • the instructions further include: constructing a second acoustic WFST, the second acoustic WFST is a search network from acoustic features to context-dependent phonemes, and integrating multiple WFSTs to generate a speech recognition WFST, wherein the multiple Each WFST includes: the second acoustic WFST.
  • the WFST integration method in the speech recognition WFST construction device is the same as the embodiment related to the speech recognition WFST construction method.
  • Embodiments of the present application also provides a speech recognition decoding device, the device includes: a processor, the processor is configured to be coupled with a memory; and read and execute instructions in the memory, where the instructions include : Receiving a voice signal; extracting an acoustic feature sequence from the voice signal; sequentially inputting the acoustic feature sequence into an acoustic WFST network to obtain the probability of each path of the acoustic feature to the pronunciation attribute; using each of the acoustic feature to the pronunciation attribute
  • the pronunciation attribute of the path output is used as the input of the pronunciation WFST network to obtain the probability of each path of the pronunciation attribute to the phoneme;
  • the phoneme output from each path of the pronunciation attribute to the phoneme is used as the input of the dictionary WFST network to obtain the phoneme to word or word Probability of each path; take the word or word output from each path of phoneme to word or word as the input of language WFST network, obtain the probability of each path of word or word sequence;
  • the obtained path mentioned above refers to an active path, where the active path refers to a path with a smaller probability after cutting out a path with a higher probability during the WFST search process.
  • Embodiments of the present application also provides a speech recognition decoding device, the device includes: a processor, the processor is configured to be coupled with a memory; and read and execute instructions in the memory, where the instructions include Receiving a voice signal; extracting an acoustic feature sequence from the voice signal; sequentially inputting the acoustic feature sequence into an acoustic WFST network to obtain the probability of each path of the acoustic feature to the pronunciation attribute; and using the acoustic feature sequence as a second acoustic WFST network input to obtain the probability of each path from the acoustic feature to the phoneme; use the pronunciation attribute output from each path of the acoustic feature to the pronunciation attribute as the input of the pronunciation WFST network to obtain the probability of the path from the pronunciation attribute to the phoneme; use the pronunciation attribute to The phoneme output from each path of the phoneme and the phoneme output from the second acoustic WFST network are used as the input of the dictionary WF
  • the obtained path mentioned above refers to an active path, where the active path refers to a path with a smaller probability after cutting out a path with a higher probability during the WFST search process.
  • the embodiment of the present application also provides a voice recognition decoding system, the system includes: a terminal and a server; the terminal is configured to receive a voice signal and send the voice signal to the server; the server It is used to receive the speech signal and extract the acoustic characteristic signal sequence from the speech signal, input the acoustic characteristic sequence into the speech recognition WFST, and obtain the probability of each path of the acoustic characteristic sequence to the word sequence; compare the probability and probability of each path The word sequence corresponding to the largest path is output as the recognition result.
  • the speech recognition WFST is a search network from acoustic features to word sequences generated by integrating acoustic WFST, pronunciation WFST, context WFST, dictionary WFST, and language WFST.
  • the acoustic WFST is a search network from acoustic features to pronunciation attributes;
  • the pronunciation WFST is a search network from pronunciation attributes to context-dependent phonemes;
  • the context WFST is a search network from context-dependent phonemes to phonemes;
  • the dictionary WFST is a search network from phonemes to words or words;
  • the language WFST is a search network from words or words to word sequences.
  • the speech recognition WFST is a search network from acoustic features to word sequences generated by integrating acoustic WFST, pronunciation WFST, dictionary WFST, and language WFST.
  • the acoustic WFST is a search network from acoustic features to pronunciation attributes;
  • the pronunciation WFST is a search network from pronunciation attributes to phonemes;
  • the dictionary WFST is a search network from phonemes to words or words;
  • the The language WFST is a search network of word or word-to-word sequences.
  • the WFST integration method in the speech recognition decoding system is the same as the embodiment related to the speech recognition WFST construction method.
  • the embodiment of the present application also provides a voice recognition decoding system, the system includes: a terminal and a server; the terminal is configured to receive a voice signal and send the voice signal to the server; the server Configured to receive the voice signal and extract an acoustic feature sequence from the voice signal;
  • the acoustic feature sequence is sequentially input into the acoustic WFST network to obtain the probability of each path from the acoustic feature to the pronunciation attribute; the pronunciation attribute output from each path of the acoustic feature to the pronunciation attribute is used as the input of the pronunciation WFST network to obtain the pronunciation attribute to the context-dependent phoneme.
  • Probability of each path Probability of each path; Probability of each path of context-dependent phoneme to phoneme with context-dependent phoneme output from each path of pronunciation attribute to context-dependent phoneme as input of context WFST; Phoneme output of each path of context-dependent phoneme to phoneme
  • the probability of each path from phoneme to word or word is obtained; the word or word output from each path of phoneme to word or word is used as input to the language WFST network to obtain each path of word or word to word sequence
  • the total probability of each path from the acoustic feature sequence to the word sequence is obtained according to the probability of each path in each WFST network, and the word sequence corresponding to the path with the largest total probability is output as the recognition result corresponding to the acoustic feature sequence .
  • the paths obtained in the foregoing steps refer to active paths, where active paths refer to paths remaining with a higher probability after being clipped from paths with a lower probability during the WFST search process.
  • the embodiment of the present application also provides a speech recognition decoding system, the system includes: a terminal and a server.
  • the embodiment of the present application also provides a voice recognition decoding system
  • the system includes: a terminal and a server; the terminal is configured to receive a voice signal and send the voice signal to the server; the server It is used to receive the speech signal and extract the acoustic feature sequence from the speech signal.
  • the sequence of the acoustic feature sequence is used as the input of the acoustic WFST network to obtain the probability of each path of the acoustic feature to the pronunciation attribute.
  • the acoustic feature is the second acoustic WFST.
  • the input of the network obtains the probability of each path from the acoustic feature to the context-dependent phoneme; the pronunciation attribute output from each path of the acoustic feature to the pronunciation attribute is used as the input of the pronunciation WFST network to obtain the probability of the path from the pronunciation attribute to the context-dependent phoneme; Use the context-dependent phonemes output from each path of the pronunciation attribute to the context-dependent phoneme and the context-dependent phonemes output from the second acoustic WFST network as the input of the context WFST network to obtain the probability of each path of the context-dependent phoneme to the phoneme; Phoneme For the input of the dictionary WFST network, obtain the probability of each path from phoneme to word or word; use the word or word output from each path of phoneme to word or word as the input of language WFST network to obtain each path from word or word to word sequence The total probability of each path from the acoustic feature sequence to the word sequence is obtained according to the probability of each path in each WFST network, and the word sequence corresponding to
  • the paths obtained in the foregoing steps refer to active paths, where active paths refer to paths remaining with a higher probability after being clipped from paths with a lower probability during the WFST search process.
  • the embodiment of the present application also provides a method for constructing an acoustic WFST.
  • the method includes: using an HMM hidden Markov model, using a pronunciation attribute as a state, and using an acoustic feature as an observation, to obtain conditions in the pronunciation attribute Constructing the acoustic WFST based on the probability of generating a given acoustic feature.
  • the pronunciation attribute is used as the state, and the acoustic feature is used as the observation to obtain the probability of generating a given acoustic feature under the condition of the pronunciation attribute.
  • the pronunciation attribute is used as the state, and the acoustic feature is used.
  • a HMM hidden Markov model combined with a forward-backward algorithm, an expectation maximization algorithm, and Viterbi is used to obtain the probability that a given acoustic feature will be generated under the condition of pronunciation attributes, and the acoustic WFST is constructed based on the probability.
  • the embodiment of the present application also provides a method for constructing a pronunciation WFST.
  • the method includes performing a neural network multi-target by taking acoustic features as input, and using pronunciation attributes and phonemes or context-dependent phonemes as dual-target outputs. Training, and finally obtain the co-occurrence probability of pronunciation attributes and phonemes or context-dependent phonemes to build pronunciation WFST.
  • An embodiment of the present application further provides a speech recognition and decoding device.
  • the device includes: a speech signal receiving unit for receiving a speech signal; and an acoustic feature extraction unit for receiving a speech signal from the speech signal receiving unit. The acoustic feature sequence is extracted from the received speech signal.
  • the first acquisition unit is used to sequentially input the acoustic feature sequence extracted by the acoustic feature extraction unit into the acoustic WFST network to obtain the probability of each path from the acoustic feature to the pronunciation attribute;
  • the second acquisition unit is used to The probability that the pronunciation attributes of each path obtained by the first obtaining unit are input to the pronunciation WFST network and the paths of the pronunciation attributes to context-dependent phonemes is obtained;
  • a third obtaining unit is configured to obtain the The context-dependent phonemes of each path are input into the context WFST network, and the probability of each path of the context-dependent phoneme to the phoneme is obtained.
  • a fourth acquisition unit is used to input the phonemes of each path obtained by the third acquisition unit into the dictionary WFST network to obtain phonemes Probability of each path to a word or word; a fifth obtaining unit, configured to combine the first The word or word of each path obtained by the four obtaining units is input to the WFST network to obtain the probability of each path of the word or word to word sequence; the result output unit is used to obtain the acoustic feature sequence according to the probability of each path obtained by each obtaining unit. The total probability of each path to the word sequence, and the word sequence corresponding to the path with the largest total probability is output as a recognition result corresponding to the acoustic feature sequence.
  • the paths obtained in the foregoing steps refer to active paths, where active paths refer to paths remaining with a higher probability after being clipped from paths with a lower probability during the WFST search process.
  • An embodiment of the present application further provides a speech recognition and decoding device.
  • the device includes: a speech signal receiving unit for receiving a speech signal; and an acoustic feature extraction unit for receiving a speech signal from the speech signal receiving unit.
  • the acoustic feature sequence is extracted from the received speech signal;
  • the first obtaining unit is configured to sequentially input the acoustic feature sequence extracted by the acoustic feature extraction unit into the acoustic WFST network to obtain the probability of each path from the acoustic feature to the pronunciation attribute;
  • the second obtaining unit The sequence of the acoustic feature sequence is used as the input of the second acoustic WFST network to obtain the probability of each path of the acoustic feature sequence to the context-dependent phoneme;
  • the third acquisition unit inputs the pronunciation attributes output by each path obtained by the first acquisition unit into the pronunciation WFST network to obtain The probability of each path of the pronunciation attribute to the context-dependent phoneme;
  • a fourth acquisition unit in
  • the paths obtained in the foregoing steps refer to active paths, where active paths refer to paths remaining with a higher probability after being clipped from paths with a lower probability during the WFST search process.
  • the above-mentioned embodiment of the present application adds acoustic features to the WFST of the pronunciation attribute and WFST of the pronunciation attribute to the phoneme to obtain a new speech recognition WFST, thereby adding a pronunciation attribute that is not affected by external interference such as noise and reverberation in the speech recognition decoding process Features, improve the robustness of the speech recognition system to the environment, and improve the accuracy of speech recognition.
  • FIG. 1a shows an example of the WFST in the embodiment of the present invention
  • FIG. 1b shows an example of the WFST in the embodiment of the present invention
  • FIG. 1c shows an example of the result of integrating the WFST in FIGS. 1a and 1b;
  • FIG. 2 is a diagram of a speech recognition decoding system according to an embodiment of the present invention.
  • FIG. 3 is a structural diagram of a speech recognition decoding system according to an embodiment of the present invention.
  • FIG. 4 is a structural diagram of another speech recognition decoding system according to an embodiment of the present invention.
  • FIG. 5 shows a flowchart of constructing a speech recognition WFST according to an embodiment of the present invention
  • FIG. 6 shows another WFST construction flowchart for speech recognition according to an embodiment of the present invention
  • FIG. 7 shows a speech recognition decoding process according to an embodiment of the present invention
  • FIG. 8 shows another speech recognition decoding process according to an embodiment of the present invention.
  • FIG. 9 shows another speech recognition decoding process according to an embodiment of the present invention.
  • FIG. 10 shows another speech recognition decoding process according to an embodiment of the present invention.
  • FIG. 11 shows still another speech recognition decoding process according to an embodiment of the present invention.
  • FIG. 12 is a structural diagram of a server according to an embodiment of the present invention.
  • FIG. 13 is a structural diagram of an electronic terminal according to an embodiment of the present invention.
  • FIG. 14 shows a structural diagram of a speech recognition decoding device according to an embodiment of the present invention.
  • FIG. 15 is a structural diagram of another speech recognition and decoding device according to an embodiment of the present invention.
  • FIG. 16 shows a structural diagram of another speech recognition decoding device according to an embodiment of the present invention.
  • FIG. 17 is a structural diagram of another speech recognition decoding device according to an embodiment of the present invention.
  • the speech recognition decoder in the embodiment of the present invention is constructed by speech recognition WFST.
  • WFST is a weighted finite state converter for large-scale speech recognition. Each state transition is marked with input and output symbols. Therefore, the constructed network (WFST) is used to generate a mapping from an input symbol sequence or string to an output string. WFST weights state transitions in addition to input and output symbols.
  • the weight value can be a coding probability, duration, or any other number accumulated along the path to calculate the overall weight that maps the input string to the output string.
  • WFST used for speech recognition usually indicates various possible path choices and their corresponding probabilities to output recognition results after inputting a speech signal in speech processing.
  • Composition between WFSTs is a combination of two different levels of WFSTs.
  • the dictionary WFST is a mapping of phonemes to words or words
  • the language WFST is a mapping of words or words to word sequences (such as word sequences).
  • the integration of two WFSTs becomes a phoneme to word sequence.
  • Figures 1a, 1b, and 1c show an integrated implementation of WFST.
  • Figures 1a and 1b are two different levels of WFST
  • Figure 1c is a new WFST generated after integration.
  • the first step in the model of Fig. 1a has two paths, the first path is 0-> 1, the probability of A1 to B1 is 0.2 (represented as A1: B1 / 0.2), and the second path is 0-> 2, A2: B2 / 0.3, and the first step in the model in Figure 1b has only one path, that is, 0-> 1, B1: C2 / 0.4, so there is only one path after integration in Figures 1a, 1b, A1-> B1-> C2, that is, the path in FIG.
  • state 1 there are also two paths to go, which are 1-> 1, A3: B2 / 0.4, 1-> 3, A1: B2 / 0.5, the network in Figure 1b is also in state 1, and there is only one path, namely 1-> 2, B2: C4 / 0.5.
  • A3: B2 can Combined with B2: C4,
  • A1: B2 can also be combined with B2: C4.
  • the two new states respectively reached are (1,1)-> (1,2) A3: C4 / 0.6 and (1,1 )-> (3,2) A1: C4 / 1.
  • FIG. 2 is a diagram of a speech recognition decoding system according to an embodiment of the present invention.
  • the speech recognition method and device of the embodiment of the present invention are applied to an electronic terminal, one or more servers (101, 102) as shown in FIG.
  • the terminal may include, but is not limited to, a smart phone, a personal computer, a tablet computer, a smart watch, smart glasses, a smart audio device, a vehicle-mounted electronic terminal, a service robot, and the like.
  • the electronic terminal and the server 101 and the server 102 may be communicatively connected through one or more networks, and the network may be a wired or wireless network, such as the Internet, a cellular network, a satellite network local area network, and / or the like.
  • the server 102 is configured to construct a speech recognition WFST, and outputs the constructed speech recognition WFST to the server 101 for construction of a speech recognition decoder and speech recognition decoding.
  • the specific construction includes: constructing an acoustic weighted finite state converter WFST, which is a search network from acoustic features to pronunciation attributes; constructing a pronunciation WFST, which is a search network from pronunciation attributes to phonemes; constructing a dictionary WFST, The dictionary WFST is a search network from phonemes to words or words; constructing a language WFST, the language WFST is a search network from words or words to word sequences; constructing a second acoustic WFST, the second acoustic WFST is from acoustic features To the search network of phonemes; where building a second acoustic WFST is an optional step;
  • the above-mentioned construction of the speech recognition WFST may also optionally include a context WFST, where the context WFST is a search network from context-dependent phonemes to phonemes; when the context recognition WFST includes a context WFST, the pronunciation WFST is a pronunciation attribute Search network to context-sensitive phonemes.
  • Integrate multiple WFSTs to generate speech recognition WFST which is a search network from acoustic features to word sequences; where the multiple WFSTs include: the acoustic WFST, the pronunciation WFST, the dictionary WFST, The language WFST, the second acoustic WFST (optional), and the context WFST (optional).
  • the electronic terminal picks up a voice sound wave through a voice acquisition device, such as a microphone, generates a voice signal, and sends the voice signal to the server 101; the server 101 is configured to receive the voice signal and extract an acoustic feature sequence from the voice signal, The extracted acoustic feature sequence is input to the speech recognition WFST for searching, and the probability of each path from the acoustic feature to the word sequence is obtained, the probabilities of the respective paths are compared, and the path with the highest probability is sent back to the terminal as a recognition result.
  • a voice acquisition device such as a microphone
  • the path obtained by the WFST search during the speech recognition process may refer to an active path.
  • the active path means that each path in the WFST has a probability value. In order to reduce the amount of calculation during decoding, some paths with lower probability will be clipped during the decoding process and no longer expand; paths with higher probability will Continue to expand, these paths are active paths.
  • the execution subject of the speech recognition WFST construction in the specific embodiment of the present invention may also be performed by an electronic terminal device, that is, the electronic terminal device executes the above-mentioned construction method of the speech recognition WFST in the embodiment of the present invention to construct the speech recognition WFST, And the speech decoding is performed based on the speech recognition WFST;
  • the execution subject of the speech recognition WFST construction may also be the server 101, that is, the server 101 and the server 102 functions are combined, and the server 101 executes the method for constructing the speech recognition WFST according to the embodiment of the present invention to construct the speech recognition WFST And, based on the speech recognition WFST, perform speech decoding on the speech signal sent by the terminal.
  • the speech recognition decoding method and device of the embodiment of the present invention are applied to the electronic terminal and the server as shown in FIG. 2.
  • the electronic terminal and the server may also receive the voice signal and decode the voice.
  • the signal is sent to the server 101; the server 101 may be used to receive the voice signal and based on the acoustic WFST, the pronunciation WFST, the dictionary WFST, the language WFST, the second acoustic WFST (optional), and the context WFST (optional ) Perform the speech recognition method in Figure 8-11.
  • the electronic terminal may receive the voice signal, and execute the steps shown in Figure 8-11 based on the acoustic WFST, the pronunciation WFST, the dictionary WFST, the language WFST, the second acoustic WFST (optional), and the context WFST (optional). Speech recognition method.
  • the speech recognition method in FIGS. 8-11 will be described in detail in the description of specific embodiments later.
  • the voice recognition decoding scheme in the embodiment of the present invention may also be a dynamic decoding scheme, which does not need to be integrated to construct a voice recognition WFST, and the terminal or the server directly decodes based on each WFST.
  • FIG. 5 The construction process of a speech recognition and decoding WFST according to an embodiment of the present invention is shown in FIG. 5 and mainly includes:
  • Step 501 Generate an acoustic WFST.
  • the acoustic WFST is a search network from acoustic features to pronunciation attributes, and may be, for example, a hidden Markov model (HMM) WFST (represented by H1).
  • HMM hidden Markov model
  • HMM Hidden Markov Model
  • HMM is a probabilistic model about time series. It describes the process of generating an unobservable random sequence of states from a hidden Markov chain, and then generating a random sequence of observations from each state.
  • the parameters of the HMM include the set of all possible states and the set of all possible observations.
  • HMM is determined by initial probability distribution, state transition probability distribution, and observation probability distribution. The initial probability distribution and the state transition probability distribution determine the state sequence, and the observation probability distribution determines the observation sequence.
  • model parameters and the observation sequence Given the model parameters and the observation sequence, calculate the probability of observing the above observation sequence under the given model by using forward-backward algorithm; given the observation sequence, estimate the model parameters by expectation maximization algorithm so that the probability of the observation sequence is maximum under the model; The model and observation sequence are determined, and the optimal state sequence is estimated by Viterbi.
  • the construction of the acoustic WFST can be based on the pronunciation attribute as the state and the acoustic feature as the observation.
  • the acoustic feature can be expressed as a sequence of various acoustic features.
  • the HMM model is used to describe the process of generating the acoustic feature from the pronunciation attribute. Observe the probabilities of the acoustic characteristics of the HMM model as a state observed by the forward-backward algorithm. Given the acoustic features, estimate the parameters of the HMM model by expecting the maximization algorithm and the observation probability so that the pronunciation attributes are observed as the state under the parameters.
  • the acoustic feature has the highest probability; the model parameters are used to estimate a pronunciation attribute through Viterbi, and the probability of generating a given observation (acoustic feature) under the conditions of the pronunciation attribute.
  • Step 502 Generate a pronunciation WFST.
  • the pronunciation WFST is a search network (represented by A) from pronunciation attributes (Articulatory Features) to phonemes or context-dependent phonemes.
  • the pronunciation attribute information may be a classification of existing known pronunciation manners, as shown in Table 1, or may be a classification according to a pronunciation position, but is not limited thereto, and a new pronunciation attribute category may be obtained by learning through a neural network.
  • Neural networks include a variety of algorithm models, including deep neural network algorithms.
  • the pronunciation WFST can be a search network of pronunciation attributes to phonemes, or a search network of pronunciation-related context-dependent phonemes. When constructing a search network of pronunciation attributes to phonemes, a deep neural network can take acoustic features as input, and use pronunciation attributes and phonemes as dual-target outputs.
  • the subsequent steps are also based on the training of pronunciation attributes and phonemes; when the pronunciation attributes are context-sensitive When searching for a phoneme, it can be a deep neural network that uses acoustic features as input and pronunciation attributes and context-sensitive phonemes as dual-target outputs.
  • the subsequent steps are also based on the training of pronunciation attributes and context-dependent phonemes.
  • the two pronunciation WFST construction processes The training method is the same, except that the training and construction goals are different. Therefore, in the description below, the two training processes are combined and described.
  • the phoneme / context-dependent phoneme representation or relationship is selected.
  • the phoneme-related training scheme is selected when generating the pronunciation attributes to the phoneme search network, and the pronunciation attributes are generated. When searching the network for context-dependent phonemes, a context-dependent phoneme-related training scheme is selected.
  • Table 1 Examples of correspondence between English phonemes and pronunciation attributes
  • Deep neural network takes acoustic features as input, and uses pronunciation attributes and phonemes / context-dependent phonemes as dual-target outputs. It uses a forward propagation algorithm to calculate the phoneme / context-dependent phoneme probability and pronunciation attribute probability for a given acoustic feature, and uses gradients. Descent algorithm trains deep neural network parameters.
  • the above-mentioned pronunciation attribute probability refers to the probability that the acoustic feature belongs to each pronunciation attribute, and the pronunciation attribute with the highest probability is defined as the pronunciation attribute of the current acoustic feature.
  • each frame of the acoustic feature can obtain the pronunciation attribute with the highest probability through the above classifier, and obtain the pronunciation attribute sequence A1, A2 ... AT of length T.
  • This sequence of pronunciation attributes is the new label.
  • Any speech frame will be classified into a certain pronunciation attribute and a certain phoneme / context-dependent phoneme.
  • the above-mentioned deep neural network is used to obtain the phoneme with the highest probability and the pronunciation attribute.
  • the context-dependent phonemes are P / P1-P + P2, and the most probable pronunciation attribute is A.
  • the pronunciation attribute A and the phoneme P / context-dependent phonemes P1-P + P2 co-occur.
  • the co-occurrence times of A and P / P1-P + P2 are counted on a large number of speech databases, and divided by the total number of frames to obtain the co-occurrence probability of A and P / P1-P + P2.
  • the co-occurrence probability of any phoneme / context-dependent phoneme and any pronunciation attribute is obtained by the same method.
  • a context WFST is generated, and the context WFST is a mapping of context-dependent phonemes to phonemes.
  • the most commonly used models of context-dependent phonemes are triphones (recorded as phonemes / left phonemes / right phonemes), and four phonemes can also be used.
  • Contextual WFST builds a mapping from context-dependent phonemes to phonemes.
  • the context WFST starts from a certain state, receives a context-dependent phoneme, outputs a phoneme and probability, reaches the target state, and completes a transition.
  • the context WFST can be generated by other methods in the prior art.
  • Step 502 is an optional step. When there is step 503, step 502 generates a search network of pronunciation attributes to context-dependent phonemes. When there is no step 503, step 502 generates a search network of pronunciation attributes to phonemes.
  • a dictionary WFST is generated.
  • the dictionary WFST is a search network (indicated by L) from phonemes to words or words.
  • Dictionaries are usually expressed in the form of word-phoneme sequences. If a word has different pronunciations, it will be represented as multiple word-phoneme sequences.
  • Disambiguation symbols are symbols # 1, # 2, # 3, etc. inserted at the end of a phoneme sequence in a dictionary.
  • the phoneme sequence is the prefix of another phoneme sequence in the dictionary, or appears in more than one word, one of these symbols needs to be added after it to ensure that the WFST is deterministic.
  • the dictionary generated by the above process represents the word-phoneme mapping relationship in the form of WFST. WFST receives phoneme sequences and outputs words.
  • the dictionary WFST can also be constructed and generated by other methods in the prior art.
  • a language WFST is generated.
  • the language WFST is a word or word-to-word sequence search network (represented by G).
  • Language models are used to describe the probability distribution of different grammatical units such as words or word-to-word sequences. They are used to calculate the probability of a word sequence appearing, or to predict the probability of a word appearing in a given historical word sequence.
  • N-gram language model uses a Markov model, assuming that the probability of a word appearing is only related to the N words that appear before it.
  • the 1-gram language model indicates that the appearance of a word is only related to itself
  • the 2-gram indicates that the appearance of a word is only related to the previous word
  • the 3-gram indicates that the appearance of a word is only related to the first two words, and so on.
  • the maximum likelihood estimation is used to perform the probability estimation, and the corresponding probability is calculated by calculating the number of times that the N-gram word sequence appears in the corpus. Represent the above word sequence and its probability as a state transition.
  • the language WFST can also be constructed and generated in other ways in the prior art.
  • step 503 is an optional step.
  • the pronunciation WFST is a search network of pronunciation attributes to context-dependent phonemes
  • step 510 is: integrating the acoustic model WFST, the pronunciation WFST, the context WFST, the dictionary WFST, and the language WFST, and determining and minimizing the integration to generate speech recognition.
  • WFST The integration process is to integrate the dictionary WFST and the language model WFST, and the WFST obtained is then integrated with the context WFST. The result is then integrated with the pronunciation WFST and finally integrated with the acoustic WFST.
  • a WFST weighted finite state converter corresponding to the acoustic features (state probability distribution) to the word sequence is obtained.
  • step 510 is: integrating the acoustic model WFST, the pronunciation WFST, the dictionary WFST, and the language WFST, and determining and minimizing the integrated WFST generation decoder.
  • the integration process is that the dictionary WFST is integrated with the language model WFST, and the obtained WFST is integrated with the pronunciation WFST and further integrated with the acoustic WFST.
  • a WFST weighted finite state converter corresponding to the acoustic features (state probability distribution) to the word sequence is obtained.
  • FIG. 6 is another speech recognition decoding WFST construction method corresponding to the embodiment of the present invention. The method differs from the speech recognition decoding WFST construction method of the embodiment shown in FIG. 5 by adding the following steps:
  • Step 606 Generate a second acoustic WFST network, and the second acoustic model WFST is a search network (represented by H2) from acoustic features to phonemes or context-dependent phonemes.
  • HMM Hidden Markov Model
  • step 603 in the speech recognition decoding WFST construction method shown in FIG. 6 is an optional step.
  • the second acoustic WFST may be a search network from acoustic features to phonemes, or a search network from acoustic features to context-dependent phonemes.
  • the second acoustic WFST generated in step 606 is a search network of acoustic features to phonemes; when step 603 is included in the construction process, the second acoustic WFST generated in step 606 is an acoustic feature to Search network for context-sensitive phonemes.
  • Steps 601, 602, 603, 604, and 605 are the same as steps 501, 502, 503, 504, and 505 in the embodiment shown in FIG. 5, and details are not described herein again. Steps 601, 602, 603, 604, 605, 606 are in no particular order.
  • Step 610 Integrate the pronunciation WFST, the acoustic model WFST, the second acoustic model WFST, the context WFST (optional), the dictionary WFST, and the language WFST, and determine and minimize the integrated WFST generation decoder.
  • the integration process includes the context WFST, it is expressed as (H1 * A + H2) * C * L * G.
  • the integration process is as follows: the integration result of the acoustic WFST and the pronunciation WFST is integrated with the second acoustic WFST to generate an acoustic feature to The context-dependent phoneme WFST integrates the dictionary WFST and the language model WFST, and the resulting finite state converter is then integrated with the context WFST.
  • the integrated result is then integrated with the network merged WFST, and the integrated speech recognition decoder WFST Expressed as (H1 * A + H2) * L * G.
  • the integration process does not include the context WFST, it is expressed as (H1 * A + H2) * L * G.
  • the integration process is to combine the integration result of the acoustic WFST and the pronunciation WFST with the second acoustic WFST to generate an acoustic feature to the phoneme.
  • WFST, the dictionary WFST is integrated with the language model WFST, and the resulting finite state converter is integrated with the network merged WFST.
  • the integrated speech recognition decoder WFST is represented by (H1 * A + H2) * L * G, Each of these successful paths represents a possible correspondence of acoustic features to word sequences. Composition of each of the above WFSTs finally forms a mapping of acoustic features to word sequences.
  • FIG. 7 illustrates a speech recognition decoding method according to an embodiment of the present invention.
  • Step 701 Extract acoustic feature information from a voice signal frame.
  • Acoustic feature extraction methods include, for example, dividing the voice signal output by the signal pickup unit into multiple voice signal frames, enhancing each voice signal frame through processing such as eliminating noise and channel distortion, and then converting each voice signal frame from the time domain to In the frequency domain, appropriate acoustic features are extracted from the converted speech signal frame.
  • Step 702 With the acoustic feature as input, search the path of the speech recognition WFST network, and obtain the probability of the acoustic feature to each path of the word sequence.
  • the voice recognition WFST may be generated by the method mentioned in FIG. 5 and FIG. 6.
  • Step 703 Compare the probabilities of the paths, and the word sequence corresponding to the path with the highest probability is output as the recognition result.
  • the path obtained by the WFST search performed in each step of the subsequent decoding process may refer to an active path.
  • the active path means that each path in the WFST has a probability value. In order to reduce the amount of calculation during decoding, some paths with lower probability will be clipped during the decoding process and no longer expand; paths with higher probability will Continue to expand, these paths are active paths.
  • the speech recognition decoding method of the above embodiment uses speech recognition WFST generated by considering the pronunciation attributes and phonemes, and the correlation between the pronunciation attributes and the acoustic features to perform speech recognition decoding, which can enhance the interference resistance to external noise and reverberation, and improve The robustness of speech recognition systems to the environment.
  • FIG. 8 illustrates another speech recognition decoding method according to an embodiment of the present invention:
  • Step 801 Extract an acoustic feature sequence from a speech signal.
  • the acoustic feature extraction method may be, for example, dividing the voice signal output by the signal pickup unit into multiple voice signal frames, enhancing each voice signal frame by processing such as eliminating noise and channel distortion, and then converting each voice signal frame from the time domain To the frequency domain, and extract appropriate acoustic features from the converted speech signal frame.
  • Step 802 Input the acoustic feature corresponding to the speech frame into the acoustic WFST, and obtain the probability of each path from the acoustic feature to the pronunciation attribute.
  • Step 803 Use the pronunciation attributes output from the paths from the acoustic features to the pronunciation attributes as the input of the pronunciation WFST network to obtain the probability of each path from the pronunciation attributes to the phonemes.
  • Step 804 Use the phoneme output from each path of the pronunciation attribute to the phoneme as the input of the dictionary WFST network to obtain the probability of each path from the phoneme to the word (or word).
  • Step 805 Use the word (or word) output from each path from the phoneme to the word (or word) as the input of the language WFST network to obtain the probability of each path from the word (or word) to the word sequence.
  • Step 810 Obtain the total probability of each path from the acoustic feature sequence to the word sequence according to the probability of each path in each WFST, and output the word sequence corresponding to the path with the largest total probability as the recognition result corresponding to the acoustic feature sequence.
  • the acoustic feature sequence generally refers to a one-to-one corresponding acoustic feature sequence in a received voice signal frame from a start frame to a last frame of the voice frame sequence.
  • WFSTs There are many ways to calculate the total path probability between WFSTs, including but not limited to summation, product, and other linear and non-linear transformations.
  • the above-mentioned acoustic WFST, pronunciation WFST, dictionary WFST, and language WFST may be generated based on the corresponding construction steps in FIG. 5.
  • the above decoding step in FIG. 8 may use a time-synchronous Viterbi Beam search algorithm as an example:
  • the Viterbi-Beam search algorithm is a width-optimized frame synchronization algorithm, and its core is a nested loop. Every time a frame is moved backward, the Viterbi algorithm is run separately for each node of the corresponding level.
  • the speech recognition and decoding method and device of the above embodiments perform speech recognition decoding by using the acoustic WFST and the pronunciation WFST, which are constructed by taking into consideration the pronunciation attributes and acoustic characteristics, and the pronunciation attributes and phoneme correlations, which can enhance interference with external noise and reverberation Resistance to improve the robustness of the speech recognition system to the environment.
  • FIG. 9 is a flowchart of another speech recognition decoding method according to an embodiment of the present invention.
  • Step 901 Extract acoustic feature information from a voice signal frame.
  • the received voice signal to be recognized can be cut into multiple voice information number frames.
  • the process of decoding and identifying is to perform acoustic feature extraction on the voice signal.
  • Step 902 Input the acoustic feature corresponding to the speech frame into the acoustic WFST, and obtain the probability score of each path from the acoustic feature to the pronunciation attribute.
  • Step 903 Use the pronunciation attributes output from each path of the learning feature to the pronunciation attribute as the input of the pronunciation WFST network to obtain the probability of each path of the pronunciation attribute to the context-dependent phoneme.
  • Step 904 Use the context-dependent phonemes output from each path of the pronunciation attribute to the context-dependent phoneme as the input of the context WFST network to obtain the probability of each path of the context-dependent phoneme to the phoneme.
  • Step 905 Use the phonemes output from the context-dependent phoneme-to-phoneme paths as input to the dictionary WFST network to obtain the probability of each phoneme-to-word (or word) path.
  • Step 906 Use the word (or word) output from each path from the phoneme to the word (or word) as the input of the language WFST network in the decoder to obtain the probability of each path from the word (or word) to the word sequence.
  • Step 910 Obtain the total probability of each path from the acoustic feature sequence of the start frame to the last frame to the word sequence according to the probability of each path in each WFST, and use the word sequence corresponding to the path with the largest total probability as the corresponding word sequence.
  • the recognition result of the acoustic feature sequence is output.
  • the above-mentioned acoustic WFST, pronunciation WFST, context WFST, dictionary WFST, and language WFST may be generated based on the corresponding construction steps in FIG. 5.
  • the acoustic WFST and the pronunciation WFST constructed by considering the pronunciation attributes and acoustic characteristics, and the pronunciation attributes and phoneme correlation are used for speech recognition decoding, but also the pronunciation knowledge of the voice can be added to the voice
  • the recognition and decoding process in the strong noise and strong reverberation environment such as the far field, the nature of the pronunciation is not disturbed by noise to make up for the problem of the inaccuracy of the traditional acoustic model after external interference such as noise and reverberation.
  • the introduction of can improve the accuracy of phoneme recognition during speech recognition.
  • FIG. 10 shows a flowchart of another speech decoding recognition method according to an embodiment of the present invention. Compared with the decoding flowchart of FIG. 8, the difference lies in the following steps:
  • Step 1002 Acoustic features are used as the input of the acoustic WFST network to obtain the probability of each path from the acoustic feature to the pronunciation attribute; using acoustic features as the input, and the acoustic features are used as the input of the second acoustic WFST network to obtain the acoustic features to the phonemes. Path probability
  • Step 1004 Use the phoneme output from each path of the pronunciation attribute to the phoneme and the acoustic feature output from the second acoustic WFST network to the phoneme in each path of the phoneme as the input of the dictionary WFST network to obtain each path from the phoneme to the word (or word) The probability;
  • the other steps 1001, 1003, 1005, and 1010 are the same as 801, 803, 805, and 810 in FIG. 8, and therefore are not described again.
  • the above-mentioned acoustic WFST, pronunciation WFST, dictionary WFST, and language WFST may be generated based on the corresponding construction steps in FIG. 6.
  • FIG. 11 shows a flowchart of another speech recognition decoding method according to an embodiment of the present invention. Compared with the decoding flowchart of FIG. 9, the difference is in the following steps:
  • Step 1102 Use the acoustic feature as the input of the acoustic WFST network to obtain the probability of each path of the acoustic feature to the pronunciation attribute; with the acoustic feature as the input, search the path of the second acoustic WFST network to obtain the path of the acoustic feature to the context-dependent phoneme The probability;
  • Step 1103 Use the pronunciation attributes output from the paths from the acoustic features to the pronunciation attributes as the input of the pronunciation WFST network to obtain the probability of each path from the pronunciation attributes to the context-dependent phonemes;
  • Step 1104 Use the context-dependent phoneme output from each path of the pronunciation attribute to the context-dependent phoneme and the acoustic feature output from the acoustic feature of the second acoustic WFST to the context-dependent phoneme as the input of the context WFST network in the decoder. Obtain the probability of each path from the context-dependent phoneme to the phoneme;
  • the above-mentioned acoustic WFST, pronunciation WFST, context WFST, dictionary WFST, and language WFST may be generated based on the corresponding construction steps in FIG. 6.
  • the traditional acoustic modeling method is improved, and the pronunciation attribute features that are not affected by external interference such as noise and reverberation are added.
  • an improved decoding search method is proposed. Use the probability that the speech frame belongs to the pronunciation attribute and the correlation between the pronunciation attribute and the phoneme to improve the robustness of the speech recognition system to the environment.
  • a context phoneme model is introduced to combine phoneme recognition with the context model to improve the accuracy of speech recognition. .
  • FIG. 3 is a structural diagram of a speech recognition decoding system according to an embodiment of the present invention; including a speech recognition WFST construction device 100 and a speech recognition decoding device 200.
  • the inventive speech recognition WFST construction device may be provided on the server 102 or the server 101 or the electronic terminal, and the speech recognition decoding device may be provided on the electronic terminal device or the server 101.
  • the speech recognition WFST construction device 100 in FIG. 3 includes: a 301 acoustic WFST generation unit, a 302 pronunciation WFST generation unit, a 303 context WFST generation unit, a 304 dictionary WFST generation unit, a 305 language WFST generation unit, and a 306 decoder generation unit.
  • the 301 acoustic WFST generation unit is used to generate an acoustic model WFST.
  • the acoustic model WFST is a search network (represented by H1) from acoustic features to pronunciation attributes.
  • the 302 pronunciation WFST generating unit is used to construct and generate the pronunciation WFST.
  • the pronunciation WFST is a search network (represented by A) from pronunciation attributes (Articulatory Features) to phonemes or context-dependent phonemes.
  • the 303 context WFST generation unit is used to generate a context WFST (represented by C).
  • the 303 context WFST can be a mapping of context-dependent phonemes to phonemes, where the context-dependent phoneme can be a triphone (recorded as phoneme / left phoneme). / Right tone).
  • the 303 context WFST generation unit is an optional unit in the construction device of the speech recognition decoder.
  • the pronunciation WFST constructed by the 302 pronunciation WFST generating unit is a search network from pronunciation attributes to context-dependent phonemes; when the constructing device of the speech recognition WFST does not include 303 context WFST Generation unit, 302 pronunciation WFST
  • the pronunciation WFST constructed by the pronunciation WFST generation unit is a search network from pronunciation attributes to phonemes.
  • the 304 dictionary WFST generating unit is used to generate a dictionary (Lexicon) WFST.
  • the dictionary WFST is a search network (indicated by L) from phonemes to words (or words).
  • the 305 language WFST generation unit is used to generate a Language Model (WF).
  • the language model WFST is a word (or word) to word sequence search network (represented by G).
  • a speech recognition WFST generating unit is configured to integrate, determine, minimize, and so on the acoustic model WFST, pronunciation WFST, dictionary WFST, and language model WFST to obtain the final speech recognition decoder WFST.
  • the integration process of the 306 speech recognition WFST generation unit is the integration operation of the 304 dictionary WFST and the 305 language model WFST, and the obtained WFST is then integrated with the 302 pronunciation WFST and further integrated with 301 acoustics.
  • WFST does integration.
  • a WFST weighted finite state converter corresponding to the acoustic features (state probability distribution) to the word sequence is obtained.
  • the integrated weighted finite state converter is expressed as H1 * A * L * G, and each successful path of the state conversion network generated by the final speech recognition decoder WFST represents a possible acoustic feature to word sequence correspondence. .
  • the integrated speech recognition decoder WFST is H1 * A * C * L * G.
  • the integration step is to integrate the 304 dictionary WFST and the 305 language model WFST, and the WFST obtained is integrated with the 303 context WFST. The obtained result is then integrated with the 302 pronunciation WFST, and finally further integrated with the 301 acoustic WFST.
  • a WFST weighted finite state converter corresponding to the acoustic features (state probability distribution) to the word sequence is obtained.
  • the speech recognition decoding device 200 includes a 307 signal pickup unit (such as a microphone) and a 310 decoder.
  • the 307 signal pickup unit (such as a microphone) is used to collect and obtain a voice sound wave to obtain a voice signal.
  • the 310 decoder includes: 308 signal processing and feature extraction unit and 309 speech recognition decoding unit.
  • the 308 signal processing and feature extraction unit is used to process the acoustic signals output by the signal pickup unit to extract acoustic features
  • the 309 speech recognition and decoding unit is used to decode the acoustic features extracted by the 308 signal processing and feature extraction unit based on the speech recognition WFST. Search to obtain the probability of each path from the acoustic feature to the word sequence, and output the recognition result (word sequence) corresponding to the path with the highest probability.
  • the speech recognition WFST is generated by the 306 speech recognition WFST generating unit mentioned above.
  • Acoustic feature extraction methods include, for example, dividing the voice signal output by the signal pickup unit into multiple voice signal frames, enhancing each voice signal frame through processing such as eliminating noise and channel distortion, and then converting each voice signal frame from the time domain to In the frequency domain, appropriate acoustic features are extracted from the converted speech signal frame.
  • FIG. 4 is a structural diagram of another speech recognition decoding system according to an embodiment of the present invention.
  • the system includes a speech recognition WFST construction device 300 and a speech recognition decoding device 400.
  • the speech recognition WFST construction device 300 410 second acoustic model WFST generating units are added to the component units.
  • the speech recognition WFST construction device 300 includes a 410 second acoustic model WFST generating unit, a 401 acoustic WFST generating unit, a 402 pronunciation WFST generating unit, a 403 context WFST generating unit, a 404 dictionary WFST generating unit, a 405 language WFST generating unit, and a 406 decoder generating unit.
  • the 401 acoustic WFST generation unit, the 402 pronunciation WFST generation unit, the 403 context WFST generation unit, the 404 dictionary WFST generation unit, and the 405 language WFST generation unit have the same functions as those in FIG. 3 and will not be described again.
  • the 410 second acoustic model WFST generating unit is configured to generate a second acoustic model WFST.
  • the second acoustic model WFST is a search network (represented by H2) from acoustic features to phonemes or context-dependent phonemes.
  • the second acoustic WFST can be constructed by a Hidden Markov Model HMM.
  • the second acoustic WFST constructed by 410 second acoustic WFST generation unit is a search network from acoustic features to context-dependent phonemes; when the construction device of speech recognition WFST does not It includes a 403 context WFST generation unit and a 410 second acoustic WFST generation unit.
  • the second acoustic WFST is a search network from acoustic features to phonemes.
  • the 406 speech recognition WFST generation unit is used to integrate, determine, minimize, etc. 401 acoustic model WFST, 410 second acoustic model WFST, 402 pronunciation WFST, 404 dictionary WFST, 405 language model WFST, and 403 context WFST to obtain the final voice Identify the decoder WFST.
  • the 403 context WFST generating unit is an optional unit, the context WFST may not be included in the integration process.
  • the integration process includes combining the result of the integration of acoustic WFST and pronunciation WFST with the second acoustic WFST to generate a WFST with acoustic features and phonemes, and the integration of the dictionary WFST with the language model WFST.
  • the obtained finite state converter is then integrated with the network merged WFST.
  • the integrated speech recognition decoder WFST is represented by (H1 * A + H2) * L * G, and each of the successful paths represents a possible Correspondence of acoustic features to word sequences.
  • the integrated speech recognition decoder WFST (indicated by (H1 * A + H2) * C * L * G).
  • the integration process includes: combining the integration results of acoustic WFST and pronunciation WFST with the second acoustic WFST to generate an acoustic feature to the phoneme WFST, integrating the dictionary WFST with the language model WFST, and obtaining the finite state converter and context WFST is integrated.
  • the result of the integration is then integrated with the WFST after the network merges.
  • the integrated speech recognition decoder WFST is represented by (H1 * A + H2) * L * G.
  • the network merge is to merge two WFST networks with the same input and output types. Specifically, it can combine the same paths in the two WFST networks with the same input and output types and combine the probabilities, and keep the different paths. Generate a new WFST network with the same input and output types.
  • the speech recognition decoding device 400 of FIG. 4 is compared with the speech recognition decoding device 200 of the embodiment in FIG. 3, and the difference is that the WFST constructed and generated by the 406 speech recognition WFST generating unit and sent to the 409 speech recognition decoding unit is different from the embodiment of FIG. 3,
  • the functions of the 407 signal pickup unit (such as a microphone) and the 408 signal processing and feature extraction unit are the same as those in the embodiment of FIG. 3. That is, the 409 speech recognition decoder is integratedly generated by the 406 speech recognition WFST generating unit, and the integration method may be an existing WFST integration method, for example, obtained by integrating in a deterministic and minimized manner.
  • FIG. 14 is a structural diagram of a speech recognition decoding device provided by an embodiment of the present application.
  • the speech recognition decoding device includes: a 1401 speech signal receiving unit, a 1402 acoustic feature extraction unit, a 1403 first acquisition unit, a 1404 second acquisition unit, a 1405 third acquisition unit, a 1406 fourth acquisition unit, and 1410 result output unit.
  • Acoustic feature extraction unit configured to extract an acoustic feature sequence from the voice signal received by the voice signal receiving unit 1401;
  • a first obtaining unit configured to sequentially input the acoustic feature sequence extracted by the acoustic feature extraction unit 1402 into the acoustic WFST network to obtain the probability of each path from the acoustic feature to the pronunciation attribute;
  • a second acquisition unit configured to input the pronunciation attributes of each path acquired by the first acquisition unit 1403 into the pronunciation WFST network, and obtain the probability of each path of the pronunciation attribute to the phoneme;
  • the third acquisition unit uses the phoneme of each path acquired by the second acquisition unit 1404 as a dictionary WFST network to acquire the probability of each path from the phoneme to the word or word;
  • the fourth obtaining unit inputs the words or words of each path obtained by the third obtaining unit 1405 into the language WFST network, and obtains the probability of each path of the word or word to word sequence;
  • result output unit configured to obtain the total probability of each path of the acoustic feature sequence to the word sequence according to the probability of each path obtained by each acquisition unit, and use the word sequence corresponding to the path with the largest total probability as the corresponding acoustic feature The recognition result of the sequence is output.
  • FIG. 15 is a structural diagram of another speech recognition and decoding device according to an embodiment of the present application.
  • the speech recognition decoding device includes: a 1501 speech signal receiving unit, a 1502 acoustic feature extraction unit, a 1503 first acquisition unit, a 1504 second acquisition unit, a 1505 third acquisition unit, a 1506 fourth acquisition unit, and 1507 fifth Acquisition unit, 1510 result output unit.
  • Acoustic feature extraction unit configured to extract an acoustic feature sequence from the voice signal received by the voice signal receiving unit 1501;
  • the first acquisition unit 1503 is configured to sequentially input the acoustic feature sequence extracted by the acoustic feature extraction unit 1502 into the acoustic WFST network to obtain the probability of each path from the acoustic feature to the pronunciation attribute;
  • a second acquisition unit configured to input the pronunciation attributes of each path acquired by the first acquisition unit 1503 into the pronunciation WFST network, and obtain the probability of each path of the pronunciation attribute to the context-dependent phoneme;
  • the third obtaining unit 1505 is configured to input the context-dependent phonemes of each path obtained by the second obtaining unit 1504 into the context WFST network to obtain the probability of each path of the context-dependent phoneme to the phoneme;
  • a fourth acquisition unit configured to input the phonemes of each path acquired by the third acquisition unit 1505 into the dictionary WFST network, and obtain the probability of each path from the phoneme to a word or word;
  • a fifth acquisition unit configured to input the words or words of each path acquired by the fourth acquisition unit 1504 into the language WFST network, and obtain the probability of each path of the word or word to word sequence;
  • 1510 result output unit configured to obtain the total probability of each path of the acoustic feature sequence to the word sequence according to the probability of each path obtained by each acquisition unit, and use the word sequence corresponding to the path with the largest total probability as the corresponding acoustic feature The recognition result of the sequence is output.
  • FIG. 16 is a structural diagram of another speech recognition and decoding device according to an embodiment of the present application.
  • the speech recognition decoding device includes: a 1601 speech signal receiving unit, a 1602 acoustic feature extraction unit, a 1603 first acquisition unit, a 1604 second acquisition unit, a 1605 third acquisition unit, a 1606 fourth acquisition unit, and 1607 fifth Acquisition unit, 1610 result output unit.
  • 1601 voice signal receiving unit for receiving voice signals
  • an acoustic feature extraction unit configured to extract an acoustic feature sequence from the voice signal received by the voice signal receiving unit 1601;
  • a first obtaining unit is configured to sequentially input the acoustic feature sequence extracted by the acoustic feature extraction unit 1602 into the acoustic WFST network to obtain the probability of each path from the acoustic feature to the pronunciation attribute;
  • the second acquisition unit takes the sequence of the acoustic feature sequence extracted by the acoustic feature extraction unit 1602 as the second acoustic WFST network input, and obtains the probability of each path of the acoustic feature sequence to the phoneme;
  • the third obtaining unit inputs the pronunciation attributes output by each path obtained by the first obtaining unit 1603 into the pronunciation WFST network, and obtains the probability of the pronunciation attributes to each path of the phoneme;
  • the fourth obtaining unit inputs the phonemes output by each path obtained by the second obtaining unit 1604 and the phonemes output by each path obtained by the third obtaining unit 1605 into the dictionary WFST network to obtain the probability of each path from the phoneme to the word or word;
  • the fifth acquisition unit inputs the words or words output by each path obtained by the fourth acquisition unit 1606 into the language WFST network, and obtains the probability of each path of the word or word to word sequence;
  • 1610 result output unit configured to obtain the total probability of each path of the acoustic feature sequence to the word sequence according to the probability of each path obtained by each acquisition unit, and use the word sequence corresponding to the path with the largest total probability as the corresponding acoustic feature The recognition result of the sequence is output.
  • FIG. 17 is a structural diagram of another speech recognition and decoding device according to an embodiment of the present application.
  • the speech recognition decoding device includes: 1701 voice signal receiving unit, 1702 acoustic feature extraction unit, 1703 first acquisition unit, 1704 second acquisition unit, 1705 third acquisition unit, 1706 fourth acquisition unit, 1707 fifth An acquisition unit, a sixth acquisition unit in 1708, and a result output unit in 1710.
  • an acoustic feature extraction unit configured to extract an acoustic feature sequence from the voice signal received by the voice signal receiving unit 1701;
  • a first acquisition unit configured to sequentially input the acoustic feature sequence extracted by the acoustic feature extraction unit 1702 into the acoustic WFST network to obtain the probability of each path from the acoustic feature to the pronunciation attribute;
  • the second acquisition unit takes the sequence of the acoustic feature sequence extracted by the acoustic feature extraction unit 1702 as the second acoustic WFST network input, and obtains the probability of each path of the acoustic feature sequence to the context-dependent phoneme;
  • the third obtaining unit inputs the pronunciation attributes output by each path obtained by the first obtaining unit 1703 into the pronunciation WFST network, and obtains the probability of each path of the pronunciation attribute to the context-dependent phoneme;
  • the fourth obtaining unit inputs the context-dependent phonemes output by each path obtained by the second obtaining unit 1704 and the context-dependent phonemes output by each path obtained by the third obtaining unit 1705 into the context WFST network to obtain each path of the context-dependent phoneme to phoneme The probability;
  • the fifth acquisition unit inputs the phonemes output by each path acquired by the fourth acquisition unit 1706 into the dictionary WFST network to obtain the probability of each path from the phoneme to the word or word;
  • the sixth obtaining unit inputs the words or words output by each path obtained by the fifth obtaining unit 1707 into the language WFST network, and obtains the probability of each path of the word or word to word sequence;
  • a 1710 result output unit is configured to obtain the total probability of each path of the acoustic feature sequence to the word sequence according to the probability of each path obtained by each acquisition unit, and use the word sequence corresponding to the path with the largest total probability as the corresponding acoustic feature The recognition result of the sequence is output.
  • FIG. 12 is a schematic structural diagram of a server according to an embodiment of the present invention.
  • the server 1208 shown in FIG. 12 is merely an example, and should not impose any limitation on the function and scope of use of the embodiment of the present invention.
  • the server 1208 is a formal representation of a general-purpose computing device.
  • the components of the server 1208 may include: one or more device processors 1201, a memory 1202, and a bus 1204 connecting different system components (including the memory 1202 and the device processor 1201).
  • the bus 1204 represents one or more of several types of bus structures, including a memory bus or a memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local area bus using any of a variety of bus structures. Generally speaking, it can be an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MAC) bus, an enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a peripheral component interconnect (PCI) bus.
  • ISA Industry Standard Architecture
  • MAC Micro Channel Architecture
  • VESA Video Electronics Standards Association
  • PCI peripheral component interconnect
  • the server 1208 typically includes a variety of computer system-readable media. These media can be any available media that can be accessed by the server 1208, including volatile and non-volatile media, removable and non-removable media.
  • the memory 1202 may include a computer system readable medium in the form of volatile memory, such as random access memory (RAM) 1211 and / or cache memory 1212.
  • RAM random access memory
  • the server 1208 may further include other removable / non-removable, volatile / nonvolatile computer system storage media.
  • the storage system 1213 may be used to read and write non-removable, non-volatile magnetic media (commonly referred to as a "hard drive").
  • each drive may be connected to the bus 1204 through one or more data medium interfaces.
  • the memory 1202 may include at least one program product having a set (for example, at least one) of program modules 1214 configured to perform the functions of the speech recognition decoding method in a specific embodiment of the present invention.
  • Program modules 1214 may be stored in, for example, memory 1202. Such program modules 1214 include, but are not limited to, an operating system, one or more application programs, other program modules, and program data. Each or some combination of these examples may include Implementation of the network environment.
  • the program module 1214 generally performs functions and / or methods in the embodiments described in the present invention.
  • the server 1208 may also communicate with one or more external devices 1206 (such as a keyboard, pointing device, display 24, etc.), may also communicate with one or more devices that enable users to interact with the server 1208, and / or with the The server 1208 can communicate with any device (such as a network card, modem, etc.) that can communicate with one or more other computing devices. This communication can take place through the user interface 1205. Moreover, the server 1208 may also communicate with one or more networks (such as a local area network (LAN), a wide area network (WAN), and / or a public network, such as the Internet) through the communication module 1203. As shown, the communication module 1203 communicates with other modules of the server 1208 through the bus 1204.
  • LAN local area network
  • WAN wide area network
  • public network such as the Internet
  • server 1208 may be used in conjunction with the server 1208, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and Data backup storage system, etc.
  • the device processor 1201 executes various functional applications and data processing by running programs stored in the memory 1201, for example:
  • the processor 1201 may be used to call a program stored in the memory 1202, for example, a method for constructing a speech recognition WFST provided by one or more embodiments of the present application, as shown in FIG. 7-11.
  • the program, or the memory 1202 may be used to store a speech recognition decoding method provided by one or more embodiments of the present application, as shown in FIG. 5-6, a program for implementing a speech recognition WFST construction method on a server side. And execute the instructions contained in the program.
  • Acoustic WFST is a search network from acoustic features to pronunciation attributes, such as Hidden Markov Model (HMM) WFST.
  • HMM Hidden Markov Model
  • the pronunciation WFST is a search network from pronunciation attributes (Articulatory Features) to phonemes or context-dependent phonemes.
  • Dictionary WFST is a search network from phonemes to words or words.
  • Language WFST is a word or word-to-word sequence search network.
  • the processor in the server of FIG. 12 may also be used to execute a server-side implementation program of the speech recognition WFST construction method of FIG. 6.
  • the processor in the server in FIG. 12 may also be used to execute a server-side implementation program of one or more of the speech recognition and decoding methods in FIG. 7-11.
  • FIG. 13 shows a schematic structural diagram of an electronic terminal according to an embodiment of the present invention.
  • the electronic terminal 400 may be various forms of mobile terminals, including a mobile phone, a tablet, a Personal Digital Assistant (PDA), and a vehicle. Terminals, wearables, smart terminals, etc.
  • the electronic terminal 1300 shown in FIG. 13 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present invention.
  • the electronic terminal 1300 includes an RF circuit 1301, a Wi-Fi module 1302, a display unit 1303, an input unit 1304, a first memory 1305, a second memory 1306, a processor 1307, a power supply 1308, and a GPS module 1309. And other hardware modules.
  • the RF circuit 1301 is used to send and receive communication signals, and can perform data interaction with other network devices through a wireless network.
  • the communication module 1302 may be a Wi-Fi module, which is used for communication interconnection through a Wi-Fi connection network. It can also be a Bluetooth module or other short-range wireless communication module.
  • the display unit 1303 is used to display a user interaction interface, and the user can access the mobile application through the display interface.
  • the display unit 1303 may include a display panel.
  • the display panel may be configured by using an LCD (Liquid Crystal Display) or an OLED (Organic Light-Emitting Diode).
  • the touch panel covers the display panel to form a touch display screen, and the processor 1307 provides a corresponding visual output on the touch display screen according to the type of touch instruction.
  • the input unit 1304 may include a touch panel, also referred to as a touch screen, which can collect a user's touch operations on or near the user (such as a user using a finger, a stylus, or any suitable object or accessory). Operate on the touch panel), you can use a variety of types such as resistive, capacitive, infrared and surface acoustic waves to achieve the touch panel.
  • the input unit 1304 may also include other input devices, including, but not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), trackball, mouse, joystick, etc. .
  • the first memory 1305 stores a preset number of APPs and interface information of the device. It is understandable that the second memory 1306 may be external storage of the electronic terminal 1300, and the first memory 1305 may be a memory of the smart device.
  • the first memory 1305 may be one of NVRAM non-volatile memory, DRAM dynamic random access memory, SRAM static random access memory, Flash flash memory, and the like; an operating system running on the smart device is usually installed on the first memory 1305.
  • the second storage 1306 may be a hard disk, an optical disk, a USB disk, a floppy disk or a tape drive, a cloud server, or the like.
  • the speech recognition decoding program or the speech recognition WFST construction program in the specific embodiment of the present invention may be stored in the first memory 1305, or may be stored in the second memory 1306.
  • the processor 1307 is a control center of the device, and uses various interfaces and lines to connect various parts of the entire device, and runs or executes software programs and / or modules stored in the first memory 1305, and calls stored in the second memory
  • the data in 1306 performs various functions of the device and processes the data.
  • the processor 1307 may include one or more processing units.
  • the power supply 1308 can power the entire device, including various types of lithium batteries.
  • the GPS module 1309 is configured to obtain position information of a user, such as position coordinates.
  • the first memory 1305 or the second memory 1306 may be used to store the speech recognition decoding method provided by one or more embodiments of the present application.
  • the speech recognition decoding method shown in FIG. 7-11 is The terminal-side implementation program, or the first memory 1305 or the second memory 1306 may be used to store the method for constructing the speech recognition WFST provided by one or more embodiments of the present application, as shown in the method for constructing the speech recognition WFST shown in FIG. 5-6. Implementation procedure on the terminal side.
  • the speech recognition decoding method provided by one or more embodiments of the present application please refer to the embodiments of FIGS. 5-11.
  • the processor 1307 may be used to read and execute computer-readable instructions. Specifically, the processor 1307 may be configured to call a program stored in the first memory 1305 or the second memory 1306, for example, a program for implementing a speech recognition decoding method provided by one or more embodiments of the present application on an electronic terminal side, or, The speech recognition WFST construction method provided by one or more embodiments of the present application implements a program on a terminal side and executes instructions included in the program.
  • the received voice signal to be recognized can be cut into multiple voice information number frames.
  • the process of decoding and identifying is to perform acoustic feature extraction on the voice signal.
  • the device processor of FIG. 13 can also be used to execute the method of constructing the speech recognition WFST of FIG. 5-6, and the implementation procedures of the speech recognition decoding method of FIGS. 7, 9-11 on the electronic terminal side.
  • the processor may also be implemented in the form of a chip.
  • the above device improves the traditional speech recognition decoder by adding features of pronunciation attributes that are not affected by external interference such as noise and reverberation.
  • an improved decoding search method is proposed, which uses the probability that the speech frame belongs to the pronunciation attribute , And the correlation between pronunciation attributes and phonemes, improving the robustness of the speech recognition system to the environment.
  • the embodiments of the present application may be provided as a method, a system, or a computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音识别方法,装置及系统。该方法包括:构建声学WFST,该声学WFST是从声学特征到发音属性的搜索网络;构建发音WFST,该发音WFST是从发音属性到音素的搜索网络;构建词典WFST,该词典WFST是从音素到字或词的搜索网络;构建语言WFST,该语言WFST是字或词到词序列的搜索网络;对上述多个WFST进行整合生成语音识别WFST,并基于该语音识别WFST进行语音识别,输出概率最大的识别结果。

Description

语音识别方法、装置及系统 技术领域
本发明涉及涉及计算机技术领域,尤其涉及语音识别技术领域。
背景技术
语音识别(Automatic Speech Recognition,ASR)是指一种从语音波形中识别出对应的文字内容的技术,是人工智能领域的重要技术之一。解码器是语音识别技术的核心模块之一,可以基于己经训练好的声学模型、语言模型及发音词典建立一个识别网络,识别网络中的各路径分别与各种文本信息、以及各文本信息的发音对应,然后针对声学模型输出的发音,在该识别网络中寻找最大的一条路径,基于该路径能够以最大概率输出该语音信号对应的文本信息,完成语音识别。
现有技术中提出了一种语音识别方法,其方案为利用汉语普通话语音学知识指导和训练数据驱动相结合的方法,通过建立判决树的方式,实现模型状态层面的参数共享,建立语境相关的声学模型,声学模型建立在声、韵母层面上,该技术方案设计了一组语音学问题提供给判决树构造算法使用,利用能够提取的汉语普通话语音区别特征,如:清、浊音,鼻音、非鼻音等(此处发音特征属于发音属性的一种),在识别解码过程中,通过判决树对减少模型匹配的盲目性,提高搜索的效率和准确性,克服了模型的精确性和可训练性之间的矛盾。
该方案通过发音属性信息对声学模型进行聚类,实现了对更多的声学模型应用,从而提高系统性能,但其声学模型构建仍是基于语音特征的统计特性,当应用于中远距离交互场景下时,由于语音会受到周围环境的噪声与混响的干扰,语音特征的统计特性发生变化,会导致声学模型性能急剧下降,语音识别的准确率低。
发明内容
为了提高语音识别的准确率。本申请实施例提供了一种语音识别解码方法、系统和装置,以及相应的语音识别加权有限状态转换器WFST的构建方法、系统和装置。
本申请的实施例,一方面提供了一种语音识别WFST的构建方法,所述方法包括:构建声学WFST(H1),所述声学WFST是从声学特征到发音属性的搜索网络;构建发音WFST(A),所述发音WFST是从发音属性到音素的搜索网络;构建词典WFST(L),所述词典WFST是从音素到字或词的搜索网络;构建语言WFST(G),所述语言WFST是字或词到词序列的搜索网络;对多个WFST进行整合生成语音识别WFST;其中所述多个WFST包括:所述声学WFST、所述发音WFST、所述词典WFST、所述语言WFST;所述整合生成的语音识别WFST为从声学特征到词序列的搜索网络,表示为H1*A*L*G。所述整合为,将词典WFST与语言模型WFST进行整合运算,得到的有限状态转换器再与发音WFST做整合运算,进一步与声学WFST整合。
可选的,上述语音识别WFST的构建方法中,构建声学加权有限状态转换器WFST (H1)为:以发音属性作为状态,以声学特征作为观测,通过采用HMM隐马尔可夫模型结合前后向算法、期望最大化算法、Viterbi,获得在发音属性条件下产生给定声学特征的概率,基于所述概率构建所述声学WFST。
可选的,上述语音识别WFST的构建方法中,构建发音WFST(A)为:深度神经网络以声学特征为输入,以音素和发音属性为双目标输出,得到概率最大的音素及发音属性,作为发音属性与音素的一次共现,通过大量语音库的输入和输出统计发音属性与音素的共现次数,并除以总帧数,得到发音属性与音素的共现概率,将发音属性与音素及其共现概率表示成发音WFST,发音WFST状态转移的输入为发音属性,输出为音素及发音属性与音素的共现概率。
可选的,构建第二声学WFST(H2),第二声学WFST是从声学特征到音素的搜索网络;所述整合的多个WFST还包括:第二声学WFST。所述对多个WFST进行整合中的多个WFST当包括第二声学WFST时,整合后的加权有限状态转换器为(H1*A+H2)*L*G。
可选的,所述对多个WFST进行整合中的多个WFST当包括第二声学WFST(C)时,整合步骤为,将声学WFST和发音WFST进行整合得到的声学特征到音素的WFST和第二声学WFST进行网络合并生成一个声学特征到音素的WFST;然后将词典WFST与语言WFST进行整合,得到的有限状态转换器再与上述网络合并后的声学特征到音素的WFST做整合生成语音识别WFST。
可选的,所述网络合并为将两个输入输出类型相同的WFST的相同路径进行合并,概率组合,不同路径保留。
可选的,所述整合过程还包括进行确定化,最小化处理。
本申请的实施例,一方面提供了一种语音识别WFST的构建方法,所述方法包括:构建声学加权有限状态转换器WFST(H1),所述声学WFST是从声学特征到发音属性的搜索网络;构建发音WFST(A),所述发音WFST是从发音属性到上下文相关音素的搜索网络;构建上下文WFST(C),所述上下文WFST是从上下文相关音素到音素的搜索网络;构建词典WFST(L),所述词典WFST是从音素到字或词的搜索网络;构建语言WFST(G),所述语言WFST是字或词到词序列的搜索网络;对多个WFST进行整合生成语音识别WFST;其中所述多个WFST包括:所述声学WFST、所述发音WFST、所述上下文WFST、所述词典WFST、所述语言WFST;所述整合生成的语音识别WFST为从声学特征到词序列的搜索网络,表示为H1*A*C*L*G。
可选的,所述整合步骤具体为,将词典WFST与语言模型WFST进行整合运算,得到的有限状态转换器再上下文WFST做整合运算,再将结果与发音WFST做整合运算,进一步与声学WFST做整合。
可选的,上述语音识别WFST的构建方法中,构建声学加权有限状态转换器WFST(H1)为:以发音属性作为状态,以声学特征作为观测序列,通过采用HMM隐马尔可夫模型结合前后向算法、期望最大化算法、Viterbi,获得在该发音属性条件下产生给定观测(声学特征)的概率,基于所述概率构建所述声学WFST。
可选的,上述语音识别WFST的构建方法中,构建发音WFST(A)为:深度神经网络以声学特征为输入,以音素和发音属性为双目标输出,得到概率最大的音素及发音属性,作为发音属性与音素的一次共现,通过大量语音库的输入和输出统计发音属性 与音素的共现次数,并除以总帧数,得到发音属性与音素的共现概率,将发音属性与音素及其共现概率表示成发音WFST,发音WFST状态转移的输入为发音属性,输出为音素及发音属性与音素的共现概率。
可选的,所述方法还包括:构建第二声学WFST,所述第二声学WFST是从声学特征到上下文相关音素的搜索网络。
可选的,所述对多个WFST进行整合中的多个WFST当包括第二声学WFST时,整合后的加权有限状态转换器为(H1*A+H2)*C*L*G。
可选的,所述对多个WFST进行整合中的多个WFST当包括第二声学WFST(C)时,整合步骤为,将声学WFST和发音WFST进行整合得到的声学特征到上下文相关音素的WFST和第二声学WFST进行网络合并生成一个声学特征到上下文相关音素的WFST;然后将词典WFST与语言WFST进行整合,得到的有限状态转换器再与上下文WFST进行整合,整合结果再与上述网络合并后的声学特征到音素的WFST做整合生成语音识别WFST。
可选的,所述网络合并为将两个输入输出类型相同的WFST的相同路径进行合并,概率组合,不同路径保留。
可选的,所述整合还包括进行确定化,最小化处理。
本申请的实施例,另一方面还提供了一种语音识别解码方法,所述方法包括:接收语音信号;从所述语音信号中提取声学特征;将所述声学特征输入语音识别WFST,获取声学特征至词序列的各路径的概率;比较各路径的概率,概率最大的路径所对应的词序列作为识别结果输出。
可选的,所述语音识别WFST是通过将声学WFST、发音WFST、上下文WFST,词典WFST、语言WFST进行整合所生成的从声学特征到词序列的搜索网络。
可选的,所述声学WFST是从声学特征到发音属性的搜索网络;所述发音WFST是从发音属性到上下文相关音素的搜索网络;所述上下文WFST是从上下文相关音素到音素的搜索网络;所述词典WFST是从音素到字或词的搜索网络;所述语言WFST是字或词到词序列的搜索网络。
可选的,所述语音识别WFST是通过将声学WFST、发音WFST、词典WFST、语言WFST进行整合所生成的从声学特征到词序列的搜索网络。
可选的,所述声学WFST是从声学特征到发音属性的搜索网络;所述发音WFST是从发音属性到音素的搜索网络;所述词典WFST是从音素到字或词的搜索网络;所述语言WFST是字或词到词序列的搜索网络。
本申请的实施例,另一方面还提供了一种语音识别解码方法,所述方法包括:接收语音信号;从所述语音信号中提取声学特征序列;将所述声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路径的概率;以声学特征至发音属性的各路径输出的发音属性作为发音WFST网络的输入,获取发音属性至上下文相关音素的各路径的概率;以第发音属性至上下文相关音素的各路径输出的上下文相关音素作为上下文WFST网络的输入,获取上下文相关音素至音素的各路径的概率以上下文相关音素至音素的各路径输出的音素作为词典WFST网络的输入,获取音素至字或词的各路径的概率;以音素至字或词的各路径径输出的字或词作为语言WFST网络的输入,获取字或词至词序列的各路径的概率;根据各WFST网络中各路径的概率来获得声学特征序列至 词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
可选的,所述语音识别解码方法还包括:所述总概率的计算方法为求和或乘积运算。
本申请的实施例,另一方面还提供了一种语音识别解码方法,所述方法包括:接收语音信号;从所述语音信号中提取声学特征序列;将所述声学特征序列顺序输入声学WFST网络的,获取声学特征至发音属性的各路径的概率;以声学特征为第二声学WFST网络的输入,获取声学特征至上下文相关音素的各路径的概率;以声学特征至发音属性的各路径输出的发音属性作为发音WFST网络的输入,获取发音属性至上下文相关音素的各路径的概率;以发音属性至上下文相关音素的各路径输出的上下文相关音素以及第二声学WFST网络输出的上下文相关音素作为所述上下文WFST网络的输入,获取上下文相关音素至音素的各路径的概率;以上下文相关音素至音素的各路径输出的音素作为词典WFST网络的输入,获取音素至字或词的各路径的概率;以音素至字或词的各路径输出的字或词作为语言WFST网络的输入,获取字或词至词序列的各路径的概率;根据各WFST网络中各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
可选的,所述语音识别解码方法还包括:所述总概率的计算方法为求和或乘积运算。
本申请的实施例,另一方面还提供了一种语音识别WFST的构建装置,所述装置包括:处理器;所述处理器用于与存储器耦合;并读取并执行存储器中的指令,所述指令包括:构建声学加权有限状态转换器WFST,所述声学WFST是从声学特征到发音属性的搜索网络;构建发音WFST,所述发音WFST是从发音属性到音素的搜索网络;构建词典WFST,所述词典WFST是从音素到字或词的搜索网络;构建语言WFST,所述语言WFST是字或词到词序列的搜索网络;对多个WFST进行整合生成语音识别WFST;其中所述多个WFST包括:所述声学WFST、所述发音WFST、所述词典WFST、所述语言WFST;所述整合生成的语音识别WFST为从声学特征到词序列的搜索网络。
可选的,所述指令还包括:构建第二声学WFST,所述第二声学WFST是从声学特征到发音属性的搜索网络;所述对多个WFST进行整合生成语音识别WFST,其中所述多个WFST包括:所述第二声学WFST。
本申请的实施例,另一方面还提供了一种语音识别WFST的构建装置,所述装置包括:处理器;所述处理器用于与存储器耦合;并读取并执行存储器中的指令,所述指令包括:构建声学加权有限状态转换器WFST,所述声学WFST是从声学特征到发音属性的搜索网络;构建发音WFST,所述发音WFST是从发音属性到上下文相关音素的搜索网络;构建上下文WFST,所述上下文WFST是从上下文相关音素到音素的搜索网络;构建词典WFST,所述词典WFST是从音素到字或词的搜索网络;构建语言WFST,所述语言WFST是字或词到词序列的搜索网络;对多个WFST进行整合生成语音识别WFST;其中所述多个WFST包括:所述声学WFST、所述发音WFST、所述上下文WFST,所述词典WFST、所述语言WFST;所述整合生成的语音识别WFST为从声学特征到词序列的搜 索网络。
可选的,所述指令还包括:构建第二声学WFST,所述第二声学WFST是从声学特征到上下文相关音素的搜索网络所述对多个WFST进行整合生成语音识别WFST,其中所述多个WFST包括:所述第二声学WFST。
上述语音识别WFST构建装置中的WFST整合方式同上述语音识别WFST构建方法相关的实施例相同。
本申请的实施例,另一方面还提供了一种语音识别解码装置,所述装置包括:处理器,所述处理器用于与存储器耦合;并读取并执行存储器中的指令,所述指令包括:接收语音信号;从所述语音信号中提取声学特征序列;将所述声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路径的概率;以所述声学特征至发音属性的各路径输出的发音属性作为发音WFST网络的输入,获取发音属性至音素的各路径的概率;以所述发音属性至音素的各路径输出的音素作为词典WFST网络的输入,获取音素至字或词的各路径的概率;以音素至字或词的各路径输出的字或词作为语言WFST网络的输入,获取字或词至词序列的各路径的概率;根据各WFST网络中各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
可选的,上述所获取的路径是指活跃路径,其中活跃路径是指WFST搜索过程中将概率较小路径裁剪后,剩余的概率较大的路径。
本申请的实施例,另一方面还提供了一种语音识别解码装置,所述装置包括:处理器,所述处理器用于与存储器耦合;并读取并执行存储器中的指令,所述指令包括:接收语音信号;从所述语音信号中提取声学特征序列;将所述声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路径的概率;以所述声学特征序列作为第二声学WFST网络输入,获取声学特征至音素的各路径的概率;以声学特征至发音属性的各路径输出的发音属性作为发音WFST网络的输入,获取发音属性至音素的各路径的概率;以发音属性至音素的各路径输出的音素和第二声学WFST网络输出的音素作为所述词典WFST网络的输入,获取音素至字或词的各路径的概率;以音素至字或词的各路径输出的字或词作为语言WFST网络的输入,获取字或词至词序列的各路径的概率;根据各WFST网络中各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
可选的,上述所获取的路径是指活跃路径,其中活跃路径是指WFST搜索过程中将概率较小路径裁剪后,剩余的概率较大的路径。
本申请的实施例,另一方面还提供了一种语音识别解码系统,所述系统包括:终端和服务器;所述终端用于接收语音信号,并将所述语音信号发送至服务器;所述服务器用于接收所述语音信号,并从语音信号中提取声学特征信序列,将所述声学特征序列输入语音识别WFST,获取声学特征序列至词序列的各路径的概率;比较各路径的概率,概率最大的路径所对应的词序列作为识别结果输出。
可选的,所述语音识别WFST是通过将声学WFST、发音WFST、上下文WFST,词典WFST、语言WFST进行整合所生成的从声学特征到词序列的搜索网络。
可选的,所述声学WFST是从声学特征到发音属性的搜索网络;所述发音WFST 是从发音属性到上下文相关音素的搜索网络;所述上下文WFST是从上下文相关音素到音素的搜索网络;所述词典WFST是从音素到字或词的搜索网络;所述语言WFST是字或词到词序列的搜索网络。
可选的,所述语音识别WFST是通过将声学WFST、发音WFST、词典WFST、语言WFST进行整合所生成的从声学特征到词序列的搜索网络。
可选的,所述声学WFST是从声学特征到发音属性的搜索网络;所述发音WFST是从发音属性到音素的搜索网络;所述词典WFST是从音素到字或词的搜索网络;所述语言WFST是字或词到词序列的搜索网络。
上述语音识别解码系统中的WFST整合方式同上述语音识别WFST构建方法相关的实施例相同。
本申请的实施例,另一方面还提供了一种语音识别解码系统,所述系统包括:终端和服务器;所述终端用于接收语音信号,并将所述语音信号发送至服务器;所述服务器用于接收所述语音信号,并从语音信号中提取声学特征序列;
将声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路径的概率;以声学特征至发音属性的各路径输出的发音属性作为发音WFST网络的输入,获取发音属性至上下文相关音素的各路径的概率;以发音属性至上下文相关音素的各路径输出的上下文相关音素作为上下文WFST的输入,获取上下文相关音素至音素的各路径的概率;以上下文相关音素至音素的各路径输出的音素作为词典WFST网络的输入,获取音素至字或词的各路径的概率;以音素至字或词的各路径输出的字或词作为语言WFST网络的输入,获取字或词至词序列的各路径的概率;根据各WFST网络中各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
可选的,上述各步骤所获取的路径是指活跃路径,其中活跃路径是指WFST搜索过程中将概率较小路径裁剪后,剩余的概率较大的路径。
本申请的实施例,另一方面还提供了一种语音识别解码系统,所述系统包括:终端和服务器。
本申请的实施例,另一方面还提供了一种语音识别解码系统,所述系统包括:终端和服务器;所述终端用于接收语音信号,并将所述语音信号发送至服务器;所述服务器用于接收所述语音信号,并从语音信号中提取声学特征序列,将声学特征序列顺序作为声学WFST网络的输入,获取声学特征至发音属性的各路径的概率;以声学特征为第二声学WFST网络的输入,获取声学特征至上下文相关音素的各路径的概率;以声学特征至发音属性的各路径输出的发音属性作为发音WFST网络的输入,获取发音属性至上下文相关音素的各路径的概率;以发音属性至上下文相关音素的各路径输出的上下文相关音素以及第二声学WFST网络输出的上下文相关音素作为上下文WFST网络的输入,获取上下文相关音素至音素的各路径的概率;以上下文相关音素至音素的各路径输出的音素作为词典WFST网络的输入,获取音素至字或词的各路径的概率;以音素至字或词的各路径输出的字或词作为语言WFST网络的输入,获取字或词至词序列的各路径的概率;根据各WFST网络中各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别 结果输出。
可选的,上述各步骤所获取的路径是指活跃路径,其中活跃路径是指WFST搜索过程中将概率较小路径裁剪后,剩余的概率较大的路径。
本申请的实施例,另一方面还提供了一种声学WFST的构建方法,所述方法包括,通过HMM隐马尔可夫模型,以发音属性作为状态,以声学特征作为观测,获得在发音属性条件下产生给定声学特征的概率,基于所述概率构建所述声学WFST。
可选的,通过HMM隐马尔可夫模型,以发音属性作为状态,以声学特征作为观测,获得在发音属性条件下产生给定声学特征的概率,进一步为,以发音属性作为状态,以声学特征作为观测,通过采用HMM隐马尔可夫模型结合前后向算法、期望最大化算法、Viterbi,获得在发音属性条件下产生给定声学特征的概率,基于所述概率构建所述声学WFST。
本申请的实施例,另一方面还提供了一种发音WFST的构建方法,所述方法包括,通过以声学特征为输入,以发音属性和音素或上下文相关音素为双目标输出进行神经网络多目标训练,最终获得发音属性与音素或上下文相关音素的共现概率来构建发音WFST。
本申请的实施例,另一方面还提供了一种语音识别解码装置,所述装置包括:语音信号接收单元,用于接收语音信号;声学特征提取单元,用于从所述语音信号接收单元接接收语音信号中提取声学特征序列;第一获取单元,用于将声学特征提取单元提取的声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路径的概率;第二获取单元,用于将所述第一获取单元获取的各路径的发音属性输入发音WFST网络的,获取发音属性至上下文相关音素的各路径的概率;第三获取单元,用于将所述第二获取单元获取的各路径的上下文相关音素输入上下文WFST网络,获取上下文相关音素至音素的各路径的概率第四获取单元,用于将所述第三获取单元获取的各路径的音素输入词典WFST网络的,获取音素至字或词的各路径的概率;第五获取单元,用于将所述第四获取单元获取的各路径的字或词输入语言WFST网络,获取字或词至词序列的各路径的概率;结果输出单元,用于根据各个获取单元获取的各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
可选的,上述各步骤所获取的路径是指活跃路径,其中活跃路径是指WFST搜索过程中将概率较小路径裁剪后,剩余的概率较大的路径。
本申请的实施例,另一方面还提供了一种语音识别解码装置,所述装置包括:语音信号接收单元,用于接收语音信号;声学特征提取单元,用于从所述语音信号接收单元接接收语音信号中提取声学特征序列;第一获取单元,用于将声学特征提取单元提取的声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路径的概率;第二获取单元,将声学特征序列顺序作为第二声学WFST网络输入,获取声学特征序列至上下文相关音素的各路径的概率;第三获取单元,将第一获取单元获取的各路径输出的发音属性输入发音WFST网络,获取发音属性至上下文相关音素的各路径的概率;第四获取单元,将第二获取单元获取的各路径输出的上下文相关音素和第三获取单元获取的各路径输出的上下文相关音素输入上下文WFST网络,获取上下文相关音素 至音素的各路径的概率;第五获取单元,将第四获取单元获取的各路径输出的音素输入词典WFST网络,获取音素至字或词的各路径的概率;第六获取单元将第五获取单元获取的各路径输出的字或词输入语言WFST网络,获取字或词至词序列的各路径的概率;结果输出单元,用于根据各个获取单元获取的各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
可选的,上述各步骤所获取的路径是指活跃路径,其中活跃路径是指WFST搜索过程中将概率较小路径裁剪后,剩余的概率较大的路径。
本申请上述实施例新增声学特征到发音属性的WFST,以及发音属性到音素的WFST得到新的语音识别WFST,从而在语音识别解码过程中加入了不受噪声、混响等外界干扰的发音属性特征,提高了语音识别系统对环境的鲁棒性,使得语音识别的准确率得到提高。
附图说明
图1a示出了本发明实施例中的WFST的一种示例;
图1b示出了本发明实施例中的WFST的一种示例;
图1c示出了对图1a、1b中的WFST进行整合后的结果示例;
图2示出了本发明实施例的一种语音识别解码系统图;
图3示出了本发明实施例的一种语音识别解码系统结构图;
图4给出了本发明实施例的另一种语音识别解码系统结构图;
图5示出了本发明实施例的一种语音识别WFST构建流程图;
图6示出了本发明实施例的另一种语音识别WFST构建流程图;
图7示出了本发明实施例的一种语音识别解码流程;
图8示出了本发明实施例的另一种语音识别解码流程;
图9示出了本发明实施例的又一种语音识别解码流程;
图10示出了本发明实施例的又一种语音识别解码流程;
图11示出了本发明实施例的又一种语音识别解码流程;
图12示出了本发明实施例的一种服务器结构图;
图13示出了本发明实施例的一种电子终端结构图;
图14示出了本发明实施例的一种语音识别解码装置的结构图;
图15示出了本发明实施例的又一种语音识别解码装置的结构图;
图16示出了本发明实施例的又一种语音识别解码装置的结构图;
图17示出了本发明实施例的又一种语音识别解码装置的结构图。
具体实施方式
为了便于理解,下面先对本发明实施例中的一些概念进行介绍:
本发明实施例中的语音识别解码器是通过语音识别WFST来构建。WFST是一种用于大规模的语音识别的加权有限状态转换器,每一个状态转换均用输入和输出符号标 记。因此,所构建的网络(WFST)用于生成从输入符号序列或字符串到输出字符串的映射。WFST除了输入和输出符号之外还对状态转换进行加权。权重值可以是编码概率,持续时间或沿路径积累的任何其他数量,以计算将输入字符串映射到输出字符串的总体权重。WFST用于语音识别通常是表示在语音处理中输入语音信号后输出识别结果的各种可能的路径选择及其相应的概率。
WFST之间的整合(Composition)是将两个不同级的WFST进行组合。如词典WFST是音素对字或词的映射,语言WFST是字或词对词序列(如词序列)的映射,两个WFST进行整合后就变成了音素对词序列的映射。图1a,1b,1c给出了一种WFST的整合实现方式,图1a、1b是两个不同级的WFST,图1c是整合后生成的新的WFST。
如图所示,例如图1a模型中的第一步有两条路径,第一路径是0->1,A1到B1的概率为0.2(表示成A1:B1/0.2),第二条路径为0->2,A2:B2/0.3,而图1b模型中的第一步只有一条路径,即0->1,B1:C2/0.4,因此图1a,1b整合后只有一条路径,A1->B1->C2,即图1a路径为0->1,A1:B1/0.2,和图1b的对应路径0->1,B1:C2/0.4结合,得到路径(0,0)->(1,1),A1:C2/0.6,如图1c所示,此处是进行概率值得相加,实际计算中概率的组合包括但不限于求和、乘积及其它线性及非线性变换。整合过程中其它路径的结合也和上述步骤类似,到达(1,1)后代表图1a网络中到达1位置了,在状态1同样有两条路径可走,分别为1->1,A3:B2/0.4,1->3,A1:B2/0.5,图1b的网络此时也是在状态1,只有一条路径,即1->2,B2:C4/0.5,在这个状态上A3:B2可以与B2:C4结合起来,A1:B2也可以与B2:C4结合起来,分别到的两个新的状态为(1,1)->(1,2)A3:C4/0.6和(1,1)->(3,2)A1:C4/1。同样的,图1a中的1->3,A1:B2/0.5可以和图1b中的2->2,B2:C4/0.6结合起来得到路径(1,2)->(3,2)A1:C4/1.1,基于上述方式就可以两个WFST网络结合起来整合为一个新的WFST图1c。
图2给出了本发明实施例的语音识别解码系统图,本发明实施例的语音识别方法和装置应用于如图2中的电子终端、一个或多个服务器(101、102),所述电子终端可以包括但不限于是智能手机、个人计算机、平板电脑、智能手表、智能眼镜、智能音响设备、车载电子终端、服务机器人等。电子终端和服务器101、服务器102可以通过一个或多个网络通信连接,网络可以是有线或无线网络,如因特网、蜂窝网络、卫星网络局域网和/或类似物。
服务器102用于构建语音识别WFST,并将构建的语音识别WFST输出给所述所述服务器101用于语音识别解码器的构建和语音识别解码。
具体构建包括:构建声学加权有限状态转换器WFST,所述声学WFST是从声学特征到发音属性的搜索网络;构建发音WFST,所述发音WFST是从发音属性到音素的搜索网络;构建词典WFST,所述词典WFST是从音素到字或词的搜索网络;构建语言WFST,所述语言WFST是字或词到词序列的搜索网络;构建第二声学WFST,所述第二声学WFST是从声学特征到音素的搜索网络;其中构建第二声学WFST为可选步骤;
上述语音识别WFST的构建还可以可选的包括上下文WFST,其中上下文WFST是从上下文相关音素到音素的搜索网络;当语音识别WFST的构建过程中包括上下文WFST时,所述发音WFST为从发音属性到上下文相关音素的搜索网络。
对多个WFST进行整合生成语音识别WFST,所述语音识别WFST为从声学特征到词 序列的搜索网络;其中所述多个WFST包括:所述声学WFST、所述发音WFST、所述词典WFST、所述语言WFST、第二声学WFST(可选)、上下文WFST(可选)。
电子终端,通过语音获取装置,如麦克风,拾取语音声波,生成语音信号,并将所述语音信号发送至服务器101;服务器101用于接收所述语音信号,并从语音信号中提取声学特征序列,将所述提取的声学特征序列输入语音识别WFST进行搜索,获取声学特征至词序列的各路径的概率,比较各路径的概率,概率最大的路径作为识别结果发送回终端。
此处语音识别过程中进行的WFST搜索所获得的路径可以是指活跃路径。其中活跃路径是指WFST中每条路径都有一个概率值,为了在解码时降低计算量,一些概率较小的路径,在解码过程中将会被裁剪,不再扩展;概率较大的路径会继续扩展,这些路径就是活跃路径。
除上述实现方式外,本发明具体实施例中的语音识别WFST构建的执行主体还可以由电子终端设备执行,即电子终端设备执行本发明实施例的上述语音识别WFST的构建方法构建语音识别WFST,并基于语音识别WFST进行语音解码;语音识别WFST构建的执行主体还可以是服务器101,即服务器101和服务器102功能合并,服务器101执行本发明实施例的上述语音识别WFST的构建方法构建语音识别WFST,并基于语音识别WFST对终端发送的语音信号进行语音解码。
以上语音识别WFST的详细构建过程,以及语音识别解码的具体方法会在后续的具体实施例中详细介绍。
除上述构建语音识别WFST进行解码的实现方式外,本发明实施例的语音识别解码方法和装置应用于如图2中的电子终端和服务器还可以是:电子终端接收语音信号,并将所述语音信号发送至服务器101;服务器101可用于接收所述语音信号,并基于声学WFST、所述发音WFST、所述词典WFST、所述语言WFST、第二声学WFST(可选)、上下文WFST(可选)执行图8-11中的语音识别方法。或者可以是电子终端接收语音信号,并基于声学WFST、所述发音WFST、所述词典WFST、所述语言WFST、第二声学WFST(可选)、上下文WFST(可选)执行图8-11中的语音识别方法。所述图8-11中的语音识别方法会在后面的具体实施例描述中详细说明。
即本发明实施例的语音识别解码方案还可以是动态的解码方案,无需整合构建语音识别WFST,终端或者服务器直接基于各个WFST进行解码。
本发明实施例的一种语音识别解码WFST构建流程如图5所示,主要包括:
步骤501:生成声学WFST,声学WFST是从声学特征到发音属性的搜索网络,可以是,如隐马尔可夫模型(HMM)WFST(以H1表示)。
HMM(Hidden Markov Model)是关于时序的概率模型,描述由一个隐藏的马尔科夫链生成不可观测的状态随机序列,再由各个状态生成观测随机序列的过程。HMM的参数中包括所有可能的状态的集合,以及所有可能的观测的集合。HMM由初始概率分布、状态转移概率分布以及观测概率分布确定。初始概率分布和状态转移概率分布决定状态序列,观测概率分布决定观测序列。给定模型参数与观测序列,通过前后向算法计算给定模型下观测到上述观测序列的概率;给定观测序列,通过期望最大化算法估计模型参数,使得在该模型下观测序列概率最大;给定模型和观测序列,通过Viterbi 估计最优状态序列。
本发明实施例中声学WFST的构建可以是以发音属性作为状态,以声学特征作为观测,其中声学特征可以表现为各种组合的声学特征序列,采用HMM模型描述由发音属性生成声学特征的过程,通过前后向算法计算HMM模型下发音属性作为状态观测到声学特征的观测概率;给定声学特征,通过期望最大化算法和观测概率估计HMM模型参数,使得在该参数下发音属性作为状态所观测到声学特征概率最大;利用模型参数,通过Viterbi估计一个发音属性,及在该发音属性条件下产生给定观测(声学特征)的概率。
步骤502:生成发音WFST,发音WFST是从发音属性(Articulatory Features)到音素或上下文相关音素的搜索网络(以A表示)。
发音属性信息可以是现有已知的发音方式的分类,如表1所示,也可以是根据发音位置的分类,但不限于此,可以通过神经网络学习获得新的发音属性类别。神经网络包括有多种算法模型,其中包括深度神经网络算法。其中发音WFST可以是发音属性到音素的搜索网络,也可以是发音属性上下文相关音素的搜索网络。当构建发音属性到音素的搜索网络时,可以是深度神经网络以声学特征为输入,以发音属性和音素为双目标输出,后续步骤也是基于发音属性和音素的训练;当构建发音属性到上下文相关音素的搜索网络时,可以是深度神经网络以声学特征为输入,以发音属性和上下文相关音素为双目标输出,后续步骤也是基于发音属性和上下文相关音素的训练,两种发音WFST的构建过程的训练方式相同,只是训练和构建的目标不同。因此下面在描述的过程中将两种训练过程合并进行描述,下面描述中音素/上下文相关音素表示或的关系,在生成发音属性到音素的搜索网络时选择音素相关的训练方案,在生成发音属性到上下文相关音素的搜索网络时选择上下文相关音素相关的训练方案。
Figure PCTCN2019092935-appb-000001
Figure PCTCN2019092935-appb-000002
表1:英语音素与发音属性对应表示例
发音属性至音素/上下文相关音素的对应可以通过以下步骤获取:
a)深度神经网络多目标训练,获取发音属性分类器,以及语音属于发音属性概率。深度神经网络以声学特征为输入,以发音属性和音素/上下文相关音素为双目标输出,利用前向传播算法计算给定声学特征条件下的音素/上下文相关音素概率及发音属性概率,并利用梯度下降算法训练深度神经网络参数。上述发音属性概率是指声学特征属于各发音属性的概率,并将概率最大的发音属性定义为当前声学特征的发音属性。
b)利用发音属性分类器,将声学特征序列重新对齐到发音级,获取到新标注。如帧数为T的声学特征序列O1,O2…OT,每一帧声学特征通过上述分类器都可以获得概率最大的发音属性,得到长度为T的发音属性序列A1,A2…AT。此发音属性序列即为新标注。
c)利用新标注A1,A2…AT重新对发音属性分类器进行多目标训练,更新上述发音属性分类器。
d)利用更新后的分类器,重新执行上述对齐获得新标注A1’,A2’…AT’。
e)任意一语音帧会被分类到某一发音属性及某一音素/上下文相关音素,例如对声学特征O1,通过上述深度神经网络,得到概率最大的音素及发音属性,记概率最大的音素/上下文相关音素为P/P1-P+P2,概率最大的发音属性为A。那么发音属性A与 音素P/上下文相关音素P1-P+P2有一次共现。在大量语音库上统计A与P/P1-P+P2的共现次数,并除以总帧数,得到A与P/P1-P+P2的共现概率。任意音素/上下文相关音素与任意发音属性的共现概率通过同样的方法得到。
f)将A、P/P1-P+P2及其共现概率表示成发音WFST。发音WFST状态转换的输入为发音属性A,输出为音素P或上下文相关音素P1-P+P2及A与P/P1-P+P2的共现概率。
步骤503,生成上下文WFST,上下文WFST是上下文相关音素到音素的映射。上下文相关音素比较常用的模型有三音子(记为音子/左音子/右音子),也可以采用四音子。
以上下文相关音素为例,在声学建模时,为考虑上下文语境对当前发音的影响,通常可以采用上下文相关的上下文相关音素模型作为基本建模单元,并且采用决策树聚类等方式减少模型状态数量,以避免上下文相关音素模型训练时的数据不足的问题。上下文WFST构建从上下文相关音素到音素的映射。上下文WFST从某一个状态出发,接收一个上下文相关音素,输出一个音素及概率,达到目的状态,完成一次转移。上下文WFST可以通过现有技术中的其他方式构建生成。
503为可选步骤,当有503步骤时,步骤502生成发音属性到上下文相关音素的搜索网络,当没有503步骤时,步骤502生成发音属性到音素的搜索网络。
步骤504,生成词典WFST,词典WFST是从音素到字或词的搜索网络(以L表示)。
词典通常以词-音素序列的形式表示。如果一个词有不同发音,则会以多个词-音素序列的形式表示。在生成词典WFST时,可以采用对音素及字(或词)进行编号,并引入消歧符号解决同音字等问题。消歧符号是在词典中的音素序列末尾插入的符号#1,#2,#3等。当音素序列是词典中另一个音素序列的前缀,或者出现在一个以上的单词中时,需要在其后加入这些符号之一,以确保WFST可确定化的。上述过程生成的词典以WFST的形式表示词-音素的映射关系。WFST接收音素序列,输出是词。词典WFST也可以通过现有技术中的其他方式构建生成。
步骤505,生成语言WFST,语言WFST是一个字或词到词序列的搜索网络(以G表示)。
语言模型用于描述字或词到词序列等不同的语法单元的概率分布的模型,用于计算一个词序列出现的概率,或者预测给定历史词语序列下,一个词语出现的概率。
N-gram语言模型最常用的的表示形式之一。它利用马尔可夫模型,假设一个词语出现的概率仅与其前面出现的N个词语有关。比如,1-gram语言模型表示词语出现仅与自身有关,2-gram表示词语出现仅与前一个词有关,3-gram表示词语出现仅与前两个词有关,等等。
在构造语言模型时采用最大似然估计来进行概率估计,通过计算N-gram词序列在语料中出现的次数来计算相应的概率。将上述词序列及其概率表示成状态转换。语言WFST也可以通过现有技术中的其他方式构建生成。
前述步骤501、步骤502、步骤503、步骤504、步骤505不分先后;其中步骤503为可选步骤。
当执行步骤503时,发音WFST为发音属性到上下文相关音素的搜索网络,步骤510为:整合声学模型WFST、发音WFST、上下文WFST、词典WFST、语言WFST,确定 化、最小化整合后生成语音识别WFST。整合过程为将词典WFST与语言模型WFST进行整合运算,得到的WFST再与上下文WFST整合运算,得到的结果再与发音WFST做整合运算,最后进一步与声学WFST做整合。完成整合运算后,即得到一个从声学特征(状态概率分布)对应到词序列的WFST加权有限状态转换器。
当不执行步骤503时,步骤510为:整合声学模型WFST、发音WFST、词典WFST、语言WFST,确定化、最小化整合后的WFST生成解码器。整合过程为,将词典WFST与语言模型WFST进行整合运算,得到的WFST再与发音WFST做整合运算,进一步与声学WFST做整合。完成整合运算后,即得到一个从声学特征(状态概率分布)对应到词序列的WFST加权有限状态转换器。
图6是对应于本发明实施例的又一种语音识别解码WFST构建方法,与图5所示的实施例的语音识别解码WFST构建方法区别在与增加了以下步骤:
步骤606:生成第二声学WFST网络,第二声学模型WFST是从声学特征到音素或者上下文相关音素的搜索网络(以H2表示)。如隐马尔可夫模型(HMM)WFST。第二声学WFST的生成方法在现有技术中已经存在,因此此处不再做详细的介绍。
同样,在图6所示的语音识别解码WFST构建方法中步骤603是可选步骤。因此对应的,上述第二声学WFST可以是声学特征到音素的搜索网络,也可以是声学特征到上下文相关音素的搜索网络。当构建的过程中不包括步骤603时,步骤606生成的第二声学WFST为声学特征到音素的搜索网络;当构建的过程中包括步骤603时,步骤606生成的第二声学WFST为声学特征到上下文相关音素的搜索网络。
步骤601、602、603、604、605与图5所示的实施例中的步骤501、502、503、504、505一样,不再赘述。步骤601、602、603、604、605、606不分先后。
步骤610:整合发音WFST、声学模型WFST、第二声学模型WFST、上下文WFST(可选)、词典WFST、语言WFST,确定化、最小化整合后的WFST生成解码器。整合过程包括上下文WFST时表示为(H1*A+H2)*C*L*G,整合过程为为,将声学WFST和发音WFST的整合结果和第二声学WFST进行网络合并,生成一个声学特征到上下文相关音素的WFST,将词典WFST与语言模型WFST进行整合,得到的有限状态转换器再与上下文WFST整合,整合后的结果再与网络合并后的WFST做整合,整合后的语音识别解码器WFST以(H1*A+H2)*L*G表示。整合过程不包括上下文WFST时表示为(H1*A+H2)*L*G,整合过程为,将声学WFST和发音WFST的整合结果和第二声学WFST进行网络合并,生成一个声学特征到音素的WFST,将词典WFST与语言模型WFST进行整合,得到的有限状态转换器再与网络合并后的WFST做整合,整合后的语音识别解码器WFST以(H1*A+H2)*L*G表示,其中的每条成功路径都表示一种可能的声学特征到词序列的对应。上述各个WFST进行Composition最后形成了声学特征到词序列的映射。
图7示出了本发明实施例的一种语音识别解码方法。
步骤701:从语音信号帧中提取声学特征信息。声学特征提取的方式有多种,本发明的实施例中并不对其进行特别限定。声学特征提取方式有:例如将信号拾取单元输出的语音信号划分成多个语音信号帧,通过消除噪音、信道失真等处理对各语音信号帧进行增强,再将各语音信号帧从时域转化到频域,并从转换后的语音信号帧内提 取合适的声学特征。
步骤702:以声学特征为输入,搜索语音识别WFST网络的路径,获取声学特征至词序列的各路径的概率。
其中所述语音识别WFST可以是通过图5、图6中所提到方法生成。
步骤703:比较各路径的概率,概率最大的路径所对应的词序列作为识别结果输出。
此处包括后续的解码过程中各步骤进行的WFST搜索所获得的路径可以是指活跃路径。其中活跃路径是指WFST中每条路径都有一个概率值,为了在解码时降低计算量,一些概率较小的路径,在解码过程中将会被裁剪,不再扩展;概率较大的路径会继续扩展,这些路径就是活跃路径。
通过上述实施例的语音识别解码方法通过采用考虑到发音属性与音素,发音属性与声学特征相关性而生成的语音识别WFST进行语音识别解码,能够增强对于外界噪声和混响的干扰抵抗能力,提高语音识别系统对环境的鲁棒性。
除了上述基于构建后的语音识别WFST的解码方法外,图8示出了本发明实施例的另一种语音识别解码方法:
步骤801:从语音信号中提取声学特征序列。
声学特征提取方式可以为:例如将信号拾取单元输出的语音信号划分成多个语音信号帧,通过消除噪音、信道失真等处理对各语音信号帧进行增强,再将各语音信号帧从时域转化到频域,并从转换后的语音信号帧内提取合适的声学特征。
步骤802:将语音帧对应的声学特征输入声学WFST,获取声学特征至发音属性的各路径的概率。
步骤803:以声学特征至发音属性的各路径输出的发音属性作为发音WFST网络的输入,获取发音属性至音素的各路径的概率。
步骤804:以发音属性至音素的各路径输出的音素作为词典WFST网络的输入,获取音素至字(或词)的各路径的概率。
步骤805:以音素至字(或词)的各路径输出的字(或词)作为语言WFST网络的输入,获取字(或词)至词序列的各路径的概率。
步骤810:根据各WFST中各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
所述声学特征序列通常是指接收的语音信号帧中从起始帧到最后一帧的语音帧序列所一一对应的声学特征序列。WFST间总路径概率的计算可以有多种方式,包括但不限于求和、乘积及其它线性及非线性变换。
在一个实施例中,上述声学WFST、发音WFST、词典WFST、语言WFST可以是基于图5中相应构建步骤的方式所生成的。
图8中的上述解码步骤可以采用时间同步的维特比光束(Time-synchronous Viterbi Beam)搜索算法举例如下:Viterbi-Beam搜索算法是一个宽度优化的帧同步算法,其核心是一嵌套循环。每当往后推移一帧,就针对相应层次的每个节点分别运行Viterbi算法。
下面给出Viterbi Beam搜索算法的基本步骤:
1.初始化搜索路径,在当前路径集合A中添加起始路径,设该路径为搜索网络的起始节点,并且设此刻时间t=0;
2.在t时刻,对于声学模型WFST的路径集合A中的每一条路径,都向后扩展一帧至所有可以达到的状态,执行Viterbi算法。比较扩展路径前驱的得分,并保留最佳得分。再利用发音WFST、词典WFST和语言WFST对路径重新判断得分;
3.利用设置的门限(光束宽度)裁剪掉不可能得分或低于门限分数的路径,保留高于得分高于门限的路径。并将这些路径添加到A中,得到t+1时刻WFST的路径集合;
4.重复步骤2-3,直到所有语音帧计算完毕。回溯集合A中得分最高的路径。
通过上述实施例的语音识别解码方法和装置通过采用考虑到发音属性与声学特征,发音属性与音素相关性而构建的声学WFST和发音WFST进行语音识别解码,能够增强对于外界噪声和混响的干扰抵抗能力,提高语音识别系统对环境的鲁棒性。
图9为本发明实施例的又一种语音识别解码方法流程:
步骤901:从语音信号帧中提取声学特征信息。
通常接收到的待识别的语音信号可以切割为多个语音信息号帧,解码识别的过程是对语音信号进行声学特征提取。
步骤902:将语音帧对应的声学特征输入声学WFST,获取声学特征至发音属性的各路径的概率得分。
步骤903:以学特征至发音属性的各路径输出的发音属性作为发音WFST网络的输入,获取发音属性至上下文相关音素的各路径的概率。
步骤904:以发音属性至上下文相关音素的各路径输出的上下文相关音素作为上下文WFST网络的输入,获取上下文相关音素至音素的各路径的概率。
步骤905:以上下文相关音素至音素的各路径输出的音素作为词典WFST网络的输入,获取音素至字(或词)的各路径的概率。
步骤906:以音素至字(或词)的各路径输出的字(或词)作为解码器中的语言WFST网络的输入,获取字(或词)至词序列的各路径的概率。
步骤910:根据各WFST中各路径的概率来获得起始帧到最后一帧的声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
WFST间总路径概率的计算可以有多种方式,包括但不限于求和、乘积及其它线性及非线性变换。
在一个实施例中,上述声学WFST、发音WFST、上下文WFST、词典WFST、语言WFST可以是基于图5中相应构建步骤的方式所生成的。
通过上述实施例的方法和装置,不仅采用了考虑到发音属性与声学特征,发音属性与音素相关性而构建的声学WFST和发音WFST进行语音识别解码,还能够将语音的发音学知识加入到语音识别解码过程中,在远场等强噪音、强混响环境下,利用发音不受噪音干扰的性质,弥补传统声学模型受到噪声和混响等外界干扰后概率不准确的问题,并且通过上下文WFST的引入,能够提升语音识别过程中音素识别的准确性。
图10给出了本发明实施例的又一种语音解码识别方法流程图,与图8的解码流程图比较,区别在以下步骤:
步骤1002、以声学特征作为声学WFST网络的输入,获取声学特征至发音属性的各路径的概率;以声学特征为输入,以声学特征作为第二声学WFST网络的输入,获取声学特征至音素的各路径的概率;
步骤1004、以发音属性至音素的各路径输出的音素以及第二声学WFST网络输出的声学特征至音素的各路径中的音素作为词典WFST网络的输入,获取音素至字(或词)的各路径的概率;
其他步骤1001、1003、1005、1010和图8中的801、803、805,、810相同,因此不再赘述。
在一个实施例中,上述声学WFST、发音WFST、词典WFST、语言WFST可以是基于图6中相应构建步骤的方式所生成的。
上述方案改进了传统声学建模的方法,加入了不受噪声、混响等外界干扰的发音属性特征,并在传统的解码搜索基础上,提出改进的解码搜索方法,利用语音帧属于发音属性的概率,以及发音属性与音素的相关性,提高语音识别系统对环境的鲁棒性。图11给出了本发明实施例的又一种的语音识别解码方法流程图,与图9的解码流程图比较,区别在以下步骤:
步骤1102、将声学特征作为声学WFST网络的输入,获取声学特征至发音属性的各路径的概率;以声学特征为输入,搜索第二声学WFST网络的路径,获取声学特征至上下文相关音素的各路径的概率;
步骤1103、以声学特征至发音属性的各路径输出的发音属性作为发音WFST网络的输入,获取发音属性至上下文相关音素的各路径的概率;
步骤1104、以发音属性至上下文相关音素的各路径输出的上下文相关音素以及第二声学WFST输出的声学特征至上下文相关音素的各路径中的上下文相关音素作为解码器中的上下文WFST网络的输入,获取上下文相关音素至音素的各路径的概率;
在一个实施例中,上述声学WFST、发音WFST、上下文WFST、词典WFST、语言WFST可以是基于图6中相应构建步骤的方式所生成的。
通过上述实施例的方法和装置,改进了传统声学建模的方法,加入了不受噪声、混响等外界干扰的发音属性特征,并在传统的解码搜索基础上,提出改进的解码搜索方法,利用语音帧属于发音属性的概率,以及发音属性与音素的相关性,提高语音识别系统对环境的鲁棒性,此外还引入了上下文音素模型,结合上下文模型进行音素识别,提升语音识别的准确度。
图3给出了本发明实施例的一种语音识别解码系统结构图;其中包括语音识别WFST构建装置100和语音识别解码装置200,基于上文对图2所示语音识别解码系统的介绍,本发明的语音识别WFST构建装置可以设置于服务器102或服务器101或电子终端,语音识别解码装置可以设置于电子终端设备或服务器101。
图3中的语音识别WFST构建装置100包括:301声学WFST生成单元、302发音WFST生成单元、303上下文WFST生成单元、304词典WFST生成单元、305语言WFST生成单元、306解码器生成单元。
其中,301声学WFST生成单元用于生成声学模型WFST,声学模型WFST是从声学特征到发音属性的搜索网络(以H1表示)。
302发音WFST生成单元用于构建生成发音WFST,发音WFST是从发音属性(Articulatory Features)到音素或上下文相关音素的搜索网络(以A表示)。
303上下文WFST生成单元用于生成上下文(context)WFST(以C表示),303上下文WFST可以是上下文相关音素到音素的映射,其中上下文相关音素可以是三音子(记为音子/左音子/右音子)。303上下文WFST生成单元为语音识别解码器的构建装置中的可选单元。当语音识别WFST的构建装置中包括303上下文WFST生成单元,302发音WFST生成单元所构建的发音WFST是从发音属性到上下文相关音素的搜索网络;当语音识别WFST的构建装置中不包括303上下文WFST生成单元,302发音WFST生成单元所构建的发音WFST是从发音属性到音素的搜索网络。
304词典WFST生成单元用于生成词典(Lexicon)WFST,词典WFST是从音素到字(或词)的搜索网络(以L表示)。
305语言WFST生成单元用于生成语言模型(Language Model)WFST,语言模型WFST是一个字(或词)到词序列的搜索网络(以G表示)。
306语音识别WFST生成单元,用于对声学模型WFST、发音WFST、词典WFST、语言模型WFST进行整合、确定化、最小化等获得最后的语音识别解码器WFST。
如果不引入上下文WFST(C),306语音识别WFST生成单元进行整合的过程,是将304词典WFST与305语言模型WFST进行整合运算,得到的WFST再与302发音WFST做整合运算,进一步与301声学WFST做整合。完成整合运算后,即得到一个从声学特征(状态概率分布)对应到词序列的WFST加权有限状态转换器。整合后的加权有限状态转换器表示为H1*A*L*G,最终得到的语音识别解码器WFST所生成的状态转换网络的每条成功路径都表示一种可能的声学特征到词序列的对应。如果引入上下文WFST(C),整合后的语音识别解码器WFST为H1*A*C*L*G。整合步骤为将304词典WFST与305语言模型WFST进行整合运算,得到的WFST再与303上下文WFST整合运算,得到的结果再与302发音WFST做整合运算,最后进一步与301声学WFST做整合。完成整合运算后,即得到一个从声学特征(状态概率分布)对应到词序列的WFST加权有限状态转换器。
其中各个WFST的具体生成方法和整合方法可以参考图5的描述,此处不再赘述。
语音识别解码装置200包括:307信号拾取单元(如麦克风)和310解码器。
307信号拾取单元(如麦克风)用于采集获得语音声波获得语音信号。
310解码器包括:308信号处理及特征提取单元和309语音识别解码单元。其中308信号处理及特征提取单元用于对信号拾取单元输出的语音信号进行处理提取声学特征,309语音识别解码单元用于基于语音识别WFST对308信号处理及特征提取单元所提取的声学特征进行解码搜索,获得声学特征至词序列的各路径的概率,输出概率最大的路径对应的识别结果(词序列)。所述语音识别WFST是上文中提到的306语音识别WFST生成单元所生成。声学特征提取的方式有多种,本发明的实施例中并不对其进行特别限定。声学特征提取方式有:例如将信号拾取单元输出的语音信号划分成多个语音信号帧,通过消除噪音、信道失真等处理对各语音信号帧进行增强,再将各语 音信号帧从时域转化到频域,并从转换后的语音信号帧内提取合适的声学特征。
图4给出了本发明实施例的另一种语音识别解码系统结构图,系统包括语音识别WFST构建装置300和语音识别解码装置400,在图3实施例的基础上,语音识别WFST构建装置300的组成单元中新增410第二声学模型WFST生成单元。
语音识别WFST构建装置300包括410第二声学模型WFST生成单元,401声学WFST生成单元、402发音WFST生成单元、403上下文WFST生成单元、404词典WFST生成单元、405语言WFST生成单元,406解码器生成单元。
其中401声学WFST生成单元、402发音WFST生成单元、403上下文WFST生成单元、404词典WFST生成单元、405语言WFST生成单元与图3中功能相同,不再赘述。
所述410第二声学模型WFST生成单元用于生成第二声学模型WFST,第二声学模型WFST是从声学特征到音素或上下文相关音素的搜索网络(以H2表示)。第二声学WFST可以由隐马尔可夫模型HMM构建。当语音识别WFST的构建装置中包括403上下文WFST生成单元,410第二声学WFST生成单元所构建的第二声学WFST是从声学特征到上下文相关音素的搜索网络;当语音识别WFST的构建装置中不包括403上下文WFST生成单元,410第二声学WFST生成单元所构建的第二声学WFST是从声学特征到音素的搜索网络。
406语音识别WFST生成单元用于对401声学模型WFST、410第二声学模型WFST,402发音WFST、404词典WFST、405语言模型WFST和403上下文WFST进行整合、确定化、最小化等获得最后的语音识别解码器WFST。其中由于403上下文WFST生成单元为可选单元,因此在整合过程中也可以不包括上下文WFST。
如果不引入上下文WFST(C),整合过程包括,将声学WFST和发音WFST的整合结果和第二声学WFST进行网络合并,生成一个声学特征到音素的WFST,将词典WFST与语言模型WFST进行整合,得到的有限状态转换器再与网络合并后的WFST做整合,整合后的语音识别解码器WFST以(H1*A+H2)*L*G表示,其中的每条成功路径都表示一种可能的声学特征到词序列的对应。如果引入上下文WFST(C),整合后的语音识别解码器WFST(以(H1*A+H2)*C*L*G表示)。整合过程包括,将声学WFST和发音WFST的整合结果和第二声学WFST进行网络合并,生成一个声学特征到音素的WFST,将词典WFST与语言模型WFST进行整合,得到的有限状态转换器再与上下文WFST整合,整合后的结果再与网络合并后的WFST做整合,整合后的语音识别解码器WFST以(H1*A+H2)*L*G表示。其中网络合并即是将两个输入输出类型相同的WFST网络进行合并,具体可以是将两个输入输出类型相同的WFST网络中的相同的路径进行合并并进行概率的组合,并将不同路径保留之后生成的输入输出类型不变的新的WFST网络。
其中各个WFST的具体生成方法和整合方法已经在图5的描述中介绍过,此处不再赘述。
图4的语音识别解码装置400与图3中实施例语音识别解码装置200比较,区别在406语音识别WFST生成单元所构建生成并发送给409语音识别解码单元的WFST和图3实施例中不同,407信号拾取单元(如麦克风)、408信号处理及特征提取单元功能和图3实施例中相同。即:409语音识别解码器是由406语音识别WFST生成单元整 合生成,其中所述整合方式可以是现有的WFST的整合方法,比如通过确定化和最小化的方式整合而得。
图14是本申请实施例提供的一种语音识别解码装置的结构图。如图所示,语音识别解码装置包括:1401语音信号接收单元,1402声学特征提取单元,1403第一获取单元,1404第二获取单元,1405第三获取单元,1406第四获取单元,1410结果输出单元。
1401语音信号接收单元,用于接收语音信号;
1402声学特征提取单元,用于从所述语音信号接收单元1401接接收语音信号中提取声学特征序列;
1403第一获取单元,用于将声学特征提取单元1402提取的声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路径的概率;
1404第二获取单元,用于将所述第一获取单元1403获取的各路径的发音属性输入发音WFST网络,获取发音属性至音素的各路径的概率;
1405第三获取单元,将第二获取单元1404获取的各路径的音素输入作为词典WFST网络,获取音素至字或词的各路径的概率;
1406第四获取单元将第三获取单元1405获取的各路径的字或词输入语言WFST网络,获取字或词至词序列的各路径的概率;
1410结果输出单元,用于根据各个获取单元获取的各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
图15是本申请实施例提供的又一种语音识别解码装置的结构图。如图所示,语音识别解码装置包括:1501语音信号接收单元,1502声学特征提取单元,1503第一获取单元,1504第二获取单元,1505第三获取单元,1506第四获取单元,1507第五获取单元,1510结果输出单元。
1501语音信号接收单元,用于接收语音信号;
1502声学特征提取单元,用于从所述语音信号接收单元1501接接收语音信号中提取声学特征序列;
1503第一获取单元,用于将声学特征提取单元1502提取的声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路径的概率;
1504第二获取单元,用于将所述第一获取单元1503获取的各路径的发音属性输入发音WFST网络的,获取发音属性至上下文相关音素的各路径的概率;
1505第三获取单元,用于将所述第二获取单元1504获取的各路径的上下文相关音素输入上下文WFST网络,获取上下文相关音素至音素的各路径的概率;
1506第四获取单元,用于将所述第三获取单元1505获取的各路径的音素输入词典WFST网络的,获取音素至字或词的各路径的概率;
1507第五获取单元,用于将所述第四获取单元1504获取的各路径的字或词输入语言WFST网络,获取字或词至词序列的各路径的概率;
1510结果输出单元,用于根据各个获取单元获取的各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所 述声学特征序列的识别结果输出。
图16是本申请实施例提供的又一种语音识别解码装置的结构图。如图所示,语音识别解码装置包括:1601语音信号接收单元,1602声学特征提取单元,1603第一获取单元,1604第二获取单元,1605第三获取单元,1606第四获取单元,1607第五获取单元,1610结果输出单元。
1601语音信号接收单元,用于接收语音信号;
1602声学特征提取单元,用于从所述语音信号接收单元1601接接收语音信号中提取声学特征序列;
1603第一获取单元,用于将声学特征提取单元1602提取的声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路径的概率;
1604第二获取单元,将声学特征提取单元1602提取的声学特征序列顺序作为第二声学WFST网络输入,获取声学特征序列至音素的各路径的概率;
1605第三获取单元,将第一获取单元1603获取的各路径输出的发音属性输入发音WFST网络,获取发音属性至音素的各路径的概率;
1606第四获取单元,将第二获取单元1604获取的各路径输出的音素和第三获取单元1605获取的各路径输出的音素输入词典WFST网络,获取音素至字或词的各路径的概率;
1607第五获取单元将第四获取单元1606获取的各路径输出的字或词输入语言WFST网络,获取字或词至词序列的各路径的概率;
1610结果输出单元,用于根据各个获取单元获取的各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
图17是本申请实施例提供的又一种语音识别解码装置的结构图。如图所示,语音识别解码装置包括:1701语音信号接收单元,1702声学特征提取单元,1703第一获取单元,1704第二获取单元,1705第三获取单元,1706第四获取单元,1707第五获取单元,1708第六获取单元,1710结果输出单元。
1701语音信号接收单元,用于接收语音信号;
1702声学特征提取单元,用于从所述语音信号接收单元1701接接收语音信号中提取声学特征序列;
1703第一获取单元,用于将声学特征提取单元1702提取的声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路径的概率;
1704第二获取单元,将声学特征提取单元1702提取的声学特征序列顺序作为第二声学WFST网络输入,获取声学特征序列至上下文相关音素的各路径的概率;
1705第三获取单元,将第一获取单元1703获取的各路径输出的发音属性输入发音WFST网络,获取发音属性至上下文相关音素的各路径的概率;
1706第四获取单元,将第二获取单元1704获取的各路径输出的上下文相关音素和第三获取单元1705获取的各路径输出的上下文相关音素输入上下文WFST网络,获取上下文相关音素至音素的各路径的概率;
1707第五获取单元,将第四获取单元1706获取的各路径输出的音素输入词典WFST 网络,获取音素至字或词的各路径的概率;
1708第六获取单元将第五获取单元1707获取的各路径输出的字或词输入语言WFST网络,获取字或词至词序列的各路径的概率;
1710结果输出单元,用于根据各个获取单元获取的各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
图12示出了本发明实施例的一种服务器的结构示意图。图12显示的服务器1208仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。
如图12所示,服务器1208是通用计算设备的形式表现。服务器1208的组件可以包括:一个或者多个设备处理器1201,存储器1202,连接不同系统组件(包括存储器1202和设备处理器1201)的总线1204。
总线1204表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。通常来说,可以是工业标准体系结构(ISA)总线,微通道体系结构(MAC)总线,增强型ISA总线、视频电子标准协会(VESA)局域总线以及外围组件互连(PCI)总线。
服务器1208典型地包括多种计算机系统可读介质。这些介质可以是任何能够被服务器1208访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。存储器1202可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(RAM)1211和/或高速缓存存储器1212。服务器1208可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。例如,存储系统1213可以用于读写不可移动的、非易失性磁介质(通常称为“硬盘驱动器”)。尽管图12中未示出,可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如CD-ROM,DVD-ROM或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线1204相连。存储器1202可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块1214,这些程序模块1214被配置以执行本发明具体实施例中语音识别解码方法的功能。
程序模块1214可以存储在例如存储器1202中,这样的程序模块1214包括但不限于操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块1214通常执行本发明所描述的实施例中的功能和/或方法。
服务器1208也可以与一个或多个外部设备1206(例如键盘、指向设备、显示器24等)通信,还可与一个或者多个使得用户能与该服务器1208交互的设备通信,和/或与使得该服务器1208能与一个或多个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过用户接口1205进行。并且,服务器1208还可以通过通信模块1203与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,通信模块1203通过总线1204与服务器1208的其它模块信。应当明白,尽管图中未示出,可以结合服务器1208使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
设备处理器1201通过运行存储在存储器1201中的程序,从而执行各种功能应用以及数据处理,例如:
处理器1201可用于调用存储于存储器1202中的程序,例如本申请的一个或多个实施例提供的语音识别WFST的构建方法,如图7-11所示的语音识别解码方法在服务器侧的实现程序,或者,存储器1202可用于存储本申请的一个或多个实施例提供的语音识别解码方法,如图5-6所示的语音识别WFST构建方法在服务器侧的实现程序。并执行程序包含的指令。
以图5所述的语音识别WFST的构建方法的实现程序为例,当处理器1201可用于调用存储于存储器1202中的语音识别WFST的构建方法在服务器侧的实现程序时,执行以下步骤:
1、生成声学WFST,声学WFST是从声学特征到发音属性的搜索网络,如隐马尔可夫模型(HMM)WFST。
2、生成发音WFST,发音WFST是从发音属性(Articulatory Features)到音素或上下文相关音素的搜索网络。
3、生成上下文WFST,上下文WFST是上下文相关音素到音素的映射(可选步骤)。
4、生成词典WFST,词典WFST是从音素到字或词的搜索网络。
5、生成语言WFST,语言WFST是一个字或词到词序列的搜索网络。
6、整合声学模型WFST、发音WFST、上下文WFST(可选)、词典WFST、语言WFST后生成语音识别WFST。
上述步骤的具体实现方法在图5的说明中已经详细介绍过,因此不再赘述。
图12的服务器中的处理器还可以用于执行图6的语音识别WFST构建方法在服务器侧的实现程序。图12的服务器中的处理器还可以用于执行图7-11中的一种或多种的语音识别解码方法在服务器侧的实现程序。
上述方法已经在上文中详细介绍过,因此不再赘述。
本发明实施例中的语音识别解码方法可以用于在电子终端中进行语音识别。图13示出了本发明实施例提供的一种电子终端的结构示意图,其中该电子终端400可以为各种形态的移动终端,包括手机,平板,PDA(Personal Digital Assistant,个人数字助理),车载终端,可穿戴设备,智能终端等。其中图13显示的电子终端1300仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。
如图13所示,该电子终端1300包括:RF电路1301、Wi-Fi模块1302、显示单元1303、输入单元1304、第一存储器1305、第二存储器1306、处理器1307、电源1308、GPS模块1309等硬件模块。
其中,RF电路1301用来收发通信信号,能够与其他网络设备通过无线网络进行数据的交互。通信模块1302可以Wi-Fi模块,用于通过Wi-Fi连接网络进行通信互联。也可以是蓝牙模块,或是其他短距离无线通信模块。
显示单元1303用来显示用户交互界面,用户可以通过显示界面访问移动应用。该显示单元1303可包括显示面板,可选的,可以采用LCD(Liquid Crystal Display,液晶显示器)或OLED(Organic Light-Emitting Diode,有机发光二极管)等形式来配置显示面板。在具体实现中,上述触控面板覆盖该显示面板,形成触摸显示屏,处 理器1307根据触摸指令的类型在触摸显示屏上提供相应的视觉输出。具体地,本发明实施例中,该输入单元1304可以包括触控面板,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板上操作),可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板。除了触控面板,输入单元1304还可以包括其他输入设备,包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
其中,第一存储器1305存储该装置预设数量的APP以及界面信息;可以理解的,第二存储器1306可以为电子终端1300的外存,第一存储器1305可以为该智能装置的内存。第一存储器1305可以为NVRAM非易失存储器、DRAM动态随机存储器、SRAM静态随机存储器、Flash闪存等其中之一;该智能装置上运行的操作系统通常安装在第一存储器1305上。第二存储器1306可以为硬盘、光盘、USB盘、软盘或磁带机、云服务器等。可选地,现在有一些第三方的APP也可以安装在第二存储器1306上。本发明的具体实施例中的语音识别解码程序或语音识别WFST构建程序都可以存储于第一存储器1305上,也可以存储于第二存储器1306上。
处理器1307是装置的控制中心,利用各种接口和线路连接整个装置的各个部分,通过运行或执行存储在该第一存储器1305内的软件程序和/或模块,以及调用存储在该第二存储器1306内的数据,执行该装置的各种功能和处理数据。可选的,该处理器1307可包括一个或多个处理单元。
电源1308可以为整个装置供电,包括各种型号的锂电池。
GPS模块1309用于获取用户的位置信息,比如位置坐标。
当第一存储器1305或第二存储器1306中安装的某个程序接收处理器的指令,执行步骤如下:
在本申请的一些实施例中,第一存储器1305或第二存储器1306可用于存储本申请的一个或多个实施例提供的语音识别解码方法,如图7-11所示的语音识别解码方法在终端侧的实现程序,或者第一存储器1305或第二存储器1306可用于存储本申请的一个或多个实施例提供的语音识别WFST的构建方法,如图5-6所示的语音识别WFST构建方法在终端侧的实现程序。关于本申请的一个或多个实施例提供的语音识别解码方法的实现,请参考图5-11的实施例。
处理器1307可用于读取和执行计算机可读指令。具体的,处理器1307可用于调用存储于第一存储器1305或第二存储器1306中的程序,例如本申请的一个或多个实施例提供的语音识别解码方法在电子终端侧的实现程序,或者,本申请的一个或多个实施例提供的语音识别WFST构建方法在终端侧的实现程序并执行程序包含的指令。
以图8语音识别解码方法在电子终端侧的实现程序为例,当处理器1307可用于调用存储于第一存储器1305或第二存储器1306中的网络拥塞方法在电子终端侧的实现程序时,执行以下步骤:
1、从语音信号帧中提取声学特征。
通常接收到的待识别的语音信号可以切割为多个语音信息号帧,解码识别的过程是对语音信号进行声学特征提取。
2、将语音帧对应的声学特征输入声学WFST,获取第一层的声学特征至发音属性 的各路径的概率。
3、以第一层的各路径输出的发音属性作为发音WFST网络的输入,获取第二层的发音属性至音素的各路径的概率。
4、以第二层的各路径输出的音素作为词典WFST网络的输入,获取第三层音素至字(或词)的各路径的概率。
5、以第三层各的路径输出的字(或词)作为语言WFST网络的输入,获取第四层字(或词)至词序列的各路径的概率。
6、根据各层中各路径的概率来获得起始帧到最后一帧的声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
图13的设备处理器还可以用于执行图5-6的语音识别WFST的构建方法,图7,9-11的语音识别解码方法在电子终端侧的实现程序。
上述方法已经在上文中详细介绍过,因此不再赘述。
具体当本发明实施例中的方法在终端中实现时,处理器还可以通过芯片的形式来实现。
上述装置通过改进传统语音识别解码器,加入了不受噪声、混响等外界干扰的发音属性特征,并在传统的解码搜索基础上,提出改进的解码搜索方法,利用语音帧属于发音属性的概率,以及发音属性与音素的相关性,提高语音识别系统对环境的鲁棒性。
本领域内的技术人员应明白,本申请实施例可提供为方法、系统、或计算机程序产品。因此,本申请实施例均可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。
显然,本领域的技术人员可以对本申请实施例进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请实施例的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (29)

  1. 一种语音识别加权有限状态转换器WFST的构建方法,其特征在于,所述方法包括:
    构建声学WFST,所述声学WFST是从声学特征到发音属性的搜索网络;
    构建发音WFST,所述发音WFST是从发音属性到音素的搜索网络;
    构建词典WFST,所述词典WFST是从音素到字或词的搜索网络;
    构建语言WFST,所述语言WFST是字或词到词序列的搜索网络;
    对多个WFST进行整合生成语音识别WFST,所述语音识别WFST为从声学特征到词序列的搜索网络;其中所述多个WFST包括:所述声学WFST、所述发音WFST、所述词典WFST、所述语言WFST。
  2. 根据权利要求1所述的方法,其特征在于,构建声学WFST包括:采用HMM隐马尔可夫模型,以发音属性作为状态,以声学特征作为观测,获得发音属性条件下产生给定声学特征的概率;
    基于所述概率构建所述声学WFST。
  3. 根据权利要求1-2任一项所述的方法,其特征在于,其特征在于,构建发音WFST包括:
    以声学特征为输入,以音素和发音属性为双目标输出进行神经网络多目标训练,获得音素与发音属性的共现概率;
    基于所述概率构建来构建所述发音WFST。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述方法还包括:
    构建第二声学WFST,所述第二声学WFST是从声学特征到音素的搜索网络;
    所述对多个WFST进行整合生成语音识别WFST,其中所述多个WFST包括:所述第二声学WFST;
    对多个WFST进行整合生成语音识别WFST包括:
    将声学WFST和发音WFST进行整合,得到整合结果,所述整合结果为声学特征到音素的搜索网络;
    将所述整合结果和第二声学WFST进行网络合并得到一个合并后的声学特征到音素的搜索网络;
    将所述合并后声学特征到音素的搜索网络和所述词典WFST及所述语言WFST进行整合生成语音识别WFST。
  5. 一种语音识别WFST的构建方法,其特征在于,所述方法包括:
    构建声学加权有限状态转换器WFST,所述声学WFST是从声学特征到发音属性的搜索网络;
    构建发音WFST,所述发音WFST是从发音属性到上下文相关音素的搜索网络;
    构建上下文WFST,所述上下文WFST是从上下文相关音素到音素的搜索网络;
    构建词典WFST,所述词典WFST是从音素到字或词的搜索网络;
    构建语言WFST,所述语言WFST是字或词到词序列的搜索网络;
    对多个WFST进行整合生成语音识别WFST,所述语音识别WFST为从声学特征到词序列的搜索网络;其中所述多个WFST包括:所述声学WFST、所述发音 WFST、所述上下文WFST,所述词典WFST、所述语言WFST。
  6. 根据权利要求5所述的方法,其特征在于,构建所述声学WFST包括:采用HMM隐马尔可夫模型,以发音属性作为状态,以声学特征作为观测,获得发音属性条件下产生给定声学特征的概率;
    基于所述概率构建所述声学WFST。
  7. 根据权利要求5-6任一项所述的方法,其特征在于,其特征在于,构建所述发音WFST包括:
    以声学特征为输入,以音素和发音属性为双目标输出进行神经网络多目标训练,获得音素与发音属性的共现概率;
    基于所述概率来构建所述发音WFST。
  8. 根据权利要求5-7任一项所述的方法,其特征在于,所述方法还包括:
    构建第二声学WFST,所述第二声学WFST是从声学特征到上下文相关音素的搜索网络;
    所述对多个WFST进行整合生成语音识别WFST,其中所述多个WFST包括:所述第二声学WFST;
    对多个WFST进行整合生成语音识别WFST包括:
    将所述声学WFST和所述发音WFST进行整合,得到整合结果,所述整合结果为声学特征到音素的搜索网络;
    将所述整合结果和第二声学WFST进行网络合并得到一个合并后的声学特征到音素的搜索网络;
    将所述合并后声学特征到音素的搜索网络和所述上下文WFST、所述词典WFST及所述语言WFST进行整合生成语音识别WFST。
  9. 一种语音识别解码方法,其特征在于,所述方法包括:
    接收语音信号;
    从所述语音信号中提取声学特征;
    将所述声学特征输入权利要求1-8任一方法中所构建的语音识别WFST,获取声学特征至词序列的各路径的概率;
    比较各路径的概率,概率最大的路径所对应的词序列作为识别结果输出。
  10. 一种语音识别解码方法,其特征在于,所述方法包括:
    接收语音信号;
    从所述语音信号中提取声学特征序列;
    将所述声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路径的概率;
    以所述声学特征至发音属性的各路径输出的发音属性作为发音WFST网络的输入,获取发音属性至音素的各路径的概率;
    以所述发音属性至音素的各路径输出的音素作为词典WFST网络的输入,获取音素至字或词的各路径的概率;
    以音素至字或词的各路径输出的字或词作为语言WFST网络的输入,获取字或词至词序列的各路径的概率;
    根据各WFST网络中各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:
    所述总概率的计算方法为求和或乘积运算。
  12. 一种语音识别解码方法,其特征在于,所述方法包括:
    接收语音信号;
    从所述语音信号中提取声学特征序列;
    将所述声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路径的概率;
    以所述声学特征序列作为第二声学WFST网络输入,获取声学特征至音素的各路径的概率;
    以声学特征至发音属性的各路径输出的发音属性作为发音WFST网络的输入,获取发音属性至音素的各路径的概率;
    以发音属性至音素的各路径输出的音素和第二声学WFST网络输出的音素作为所述词典WFST网络的输入,获取音素至字或词的各路径的概率;
    以音素至字或词的各路径输出的字或词作为语言WFST网络的输入,获取字或词至词序列的各路径的概率;
    根据各WFST网络中各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
  13. 根据权利要求12所述的方法,其特征在于,所述方法还包括:
    所述总概率的计算方法为求和或乘积运算。
  14. 一种语音识别WFST的构建装置,其特征在于,所述装置包括:处理器和存储器;
    所述处理器用于与所述存储器耦合;并读取存储器中的指令,并根据所述指令执行权利要求1-8中任一项所述的方法。
  15. 一种语音识别解码装置,其特征在于,所述装置包括:处理器和存储器;
    所述处理器用于与所述存储器耦合;并读取存储器中的指令,并根据所述指令执行权利要求9-13中任一项所述的方法。
  16. 一种语音识别解码系统,其特征在于,所述系统包括:终端和服务器;
    所述终端用于接收语音信号,并将所述语音信号发送至服务器;
    所述服务器用于接收所述语音信号,并从语音信号中提取声学特征信序列,
    将所述声学特征序列输入权利要求14的语音识别WFST构建装置所构建的语音识别WFST,获取声学特征序列至词序列的各路径的概率;比较各路径的概率,概率最大的路径所对应的词序列作为识别结果输出。
  17. 一种语音识别解码系统,其特征在于,所述系统包括:终端和服务器;
    所述终端用于接收语音信号,并将所述语音信号发送至服务器;
    所述服务器用于接收所述语音信号,并从语音信号中提取声学特征序列,
    将声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路 径的概率;
    以声学特征至发音属性的各路径输出的发音属性作为发音WFST网络的输入,获取发音属性至音素的各路径的概率;
    以发音属性至音素的各路径输出的音素作为词典WFST网络的输入,获取音素至字或词的各路径的概率;
    以音素至字或词的的各路径输出的字或词作为语言WFST网络的输入,获取字或词至词序列的各路径的概率;
    根据各WFST网络中各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
  18. 根据权利要求17所述的系统,其特征在于,所述服务器根据各WFST网络中各路径的概率来获得声学特征序列至词序列的各路径的总概率,其中所述总概率的计算方法为求和或乘积运算。
  19. 一种语音识别解码系统,其特征在于,所述系统包括:终端和服务器;
    所述终端用于接收语音信号,并将所述语音信号发送至服务器;
    所述服务器用于接收所述语音信号,并从语音信号中提取声学特征序列,
    将声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路径的概率;
    以声学特征作为第二声学WFST网络输入,获取声学特征至音素的各路径的概率;
    以声学特征至发音属性的各路径输出的发音属性作为发音WFST网络的输入,获取发音属性至音素的各路径的概率;
    以发音属性至音素的各路径输出的音素和第二声学WFST网络输出的音素作为所述词典WFST的输入,获取音素至字或词的各路径的概率;
    以音素至字或词的各路径输出的字或词作为语言WFST网络的输入,获取字或词至词序列的各路径的概率;
    根据各WFST网络中各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
  20. 根据权利要求19所述的系统,其特征在于,所述服务器根据各WFST网络中各路径的概率来获得声学特征序列至词序列的各路径的总概率,其中所述总概率的计算方法为求和或乘积运算。
  21. 一种语音识别WFST的构建装置,其特征在于,所述装置包括:
    声学WFST生成单元,用于生成声学WFST,所述声学WFST是从声学特征到发音属性的搜索网络;
    发音WFST生成单元,用于生成发音WFST,所述发音WFST是从发音属性到音素的搜索网络;
    词典WFST生成单元,用于生成词典WFST,所述词典WFST是从音素到字或词的搜索网络;
    语言WFST生成单元,用于生成语言WFST,所述语言WFST是字或词到词序列的搜索网络;
    解码器生成单元,用于对多个WFST进行整合,生成语音识别WFST;所述整合生成的语音识别WFST为从声学特征到词序列的搜索网络;其中所述多个WFST包括:所述声学WFST生成单元生成的声学WFST,所述发音WFST生成单元生成的发音WFST,所述词典WFST生成单元生成的词典WFST,所述语言WFST生成单元生成的语言WFST。
  22. 根据权利要求21所述的装置,其特征在于,所述装置还包括:
    第二声学WFST生成单元,用于生成第二声学WFST,所述第二声学WFST是从声学特征到音素的搜索网络;
    所述对多个WFST进行整合生成语音识别WFST,其中所述多个WFST包括:第二声学WFST;
    将声学WFST和发音WFST进行整合,得到整合结果,所述整合结果为声学特征到音素的搜索网络;
    将所述整合结果和第二声学WFST进行网络合并得到一个合并后的声学特征到音素的搜索网络;
    将所述合并后声学特征到音素的搜索网络和所述词典WFST及所述语言WFST进行整合生成语音识别WFST。
  23. 一种语音识别WFST的构建装置,其特征在于,所述装置包括:
    声学WFST生成单元,用于生成声学WFST,所述声学WFST是从声学特征到发音属性的搜索网络;
    发音WFST生成单元,用于生成发音WFST,所述发音WFST是从发音属性到上下文相关音素的搜索网络;
    上下文WFST生成单元,用于生成上下文WFST,所述上下文WFST是从上下文相关音素到音素的搜索网络;
    词典WFST生成单元,用于生成词典WFST,所述词典WFST是从音素到字或词的搜索网络;
    语言WFST生成单元,用于生成语言WFST,所述语言WFST是字或词到词序列的搜索网络;
    语音识别WFST生成单元,用于对多个WFST进行整合,生成语音识别WFST;其中所述多个WFST包括:所述声学WFST生成单元生成的声学WFST,所述发音WFST生成单元生成的发音WFST,所述词典WFST生成单元生成的词典WFST,所述语言WFST生成单元生成的语言WFST;
    所述整合生成的语音识别WFST为从声学特征到词序列的搜索网络。
  24. 根据权利要求23所述的装置,其特征在于,所述装置还包括:
    第二声学WFST生成单元,用于生成第二声学WFST,所述第二声学WFST是从声学特征到上下文相关音素的搜索网络;
    所述对多个WFST进行整合生成语音识别WFST,其中所述多个WFST包括:第二声学WFST;
    将声学WFST和发音WFST进行整合,得到整合结果,所述整合结果为声学特征到音素的搜索网络;
    将所述整合结果和第二声学WFST进行网络合并得到一个合并后的声学特征到上下文相关音素的搜索网络;
    将所述合并后声学特征到音素的搜索网络和所述上下文WFST、所述词典WFST及所述语言WFST进行整合生成语音识别WFST。
  25. 一种语音识别解码装置,其特征在于,所述装置包括:
    语音信号接收单元,用于接收语音信号;
    声学特征提取单元,用于从所述语音信号接收单元接接收语音信号中提取声学特征序列;
    语音识别解码单元,用于将声学特征序列输入权利要求14、21-24任一项所述的语音识别WFST构建装置所构建的语音识别WFST,获取声学特征序列至词序列的各路径的概率;比较各路径的概率,概率最大的路径作为识别结果输出。
  26. 一种语音识别解码装置,其特征在于,所述装置包括:
    语音信号接收单元,用于接收语音信号;
    声学特征提取单元,用于从所述语音信号接收单元接接收语音信号中提取声学特征序列;
    第一获取单元,用于将声学特征提取单元提取的声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路径的概率;
    第二获取单元将所述第一获取单元获取的各路径的发音属性输入发音WFST网络,获取发音属性至音素的各路径的概率;
    第三获取单元,将第二获取单元获取的各路径的音素输入作为词典WFST网络,获取音素至字或词的各路径的概率;
    第四获取单元将第三获取单元获取的各路径的字或词输入语言WFST网络,获取字或词至词序列的各路径的概率;
    结果输出单元,用于根据各个获取单元获取的各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
  27. 根据权利要求26所述的装置,其特征在于,所述结果输出单元根据各个获取单元获取的各路径的概率来获得声学特征序列至词序列的各路径的总概率,其中所述总概率的计算方法为求和或乘积运算。
  28. 一种语音识别解码装置,其特征在于,所述装置包括:
    语音信号接收单元,用于接收语音信号;
    声学特征提取单元,用于从所述语音信号接收单元接接收语音信号中提取声学特征序列;
    第一获取单元,用于将声学特征提取单元提取的声学特征序列顺序输入声学WFST网络,获取声学特征至发音属性的各路径的概率;
    第二获取单元,将声学特征序列顺序作为第二声学WFST网络输入,获取声学特征序列至音素的各路径的概率;
    第三获取单元,将第一获取单元获取的各路径输出的发音属性输入发音WFST网络,获取发音属性至音素的各路径的概率;
    第四获取单元,将第二获取单元获取的各路径输出的音素和第三获取单元获取的各路径输出的音素输入词典WFST网络,获取音素至字或词的各路径的概率;
    第五获取单元将第四获取单元获取的各路径输出的字或词输入语言WFST网络,获取字或词至词序列的各路径的概率;
    结果输出单元,用于根据各个获取单元获取的各路径的概率来获得声学特征序列至词序列的各路径的总概率,将总概率最大的路径所对应的词序列作为对应于所述声学特征序列的识别结果输出。
  29. 根据权利要求28所述的装置,其特征在于,所述结果输出单元根据各个获取单元获取的各路径的概率来获得声学特征序列至词序列的各路径的总概率,其中所述总概率的计算方法为求和或乘积运算。
PCT/CN2019/092935 2018-06-26 2019-06-26 语音识别方法、装置及系统 WO2020001458A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810671596.6 2018-06-26
CN201810671596.6A CN109036391B (zh) 2018-06-26 2018-06-26 语音识别方法、装置及系统

Publications (1)

Publication Number Publication Date
WO2020001458A1 true WO2020001458A1 (zh) 2020-01-02

Family

ID=64611065

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/092935 WO2020001458A1 (zh) 2018-06-26 2019-06-26 语音识别方法、装置及系统

Country Status (2)

Country Link
CN (1) CN109036391B (zh)
WO (1) WO2020001458A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112116908A (zh) * 2020-11-18 2020-12-22 北京声智科技有限公司 唤醒音频确定方法、装置、设备及存储介质

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109036391B (zh) * 2018-06-26 2021-02-05 华为技术有限公司 语音识别方法、装置及系统
CN110310644A (zh) * 2019-06-28 2019-10-08 广州云蝶科技有限公司 基于语音识别的智慧班牌交互方法
CN110808032B (zh) * 2019-09-20 2023-12-22 平安科技(深圳)有限公司 一种语音识别方法、装置、计算机设备及存储介质
CN111048085A (zh) * 2019-12-18 2020-04-21 佛山市顺德区美家智能科技管理服务有限公司 基于zigbee无线技术的离线式语音控制方法、系统及存储介质
CN110992932B (zh) * 2019-12-18 2022-07-26 广东睿住智能科技有限公司 一种自学习的语音控制方法、系统及存储介质
CN111243599B (zh) * 2020-01-13 2022-12-20 网易有道信息技术(北京)有限公司 语音识别模型构建方法、装置、介质及电子设备
WO2021183052A1 (en) * 2020-03-12 2021-09-16 National University Of Singapore System and method for extracting data
CN112820281B (zh) * 2020-12-31 2022-09-23 北京声智科技有限公司 一种语音识别方法、装置及设备
CN113516972B (zh) * 2021-01-12 2024-02-13 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机设备及存储介质
CN113096642A (zh) * 2021-03-31 2021-07-09 南京地平线机器人技术有限公司 语音识别方法和装置、计算机可读存储介质、电子设备
CN113436619B (zh) * 2021-05-28 2022-08-26 中国科学院声学研究所 一种语音识别解码的方法及装置
CN113421587B (zh) * 2021-06-02 2023-10-13 网易有道信息技术(北京)有限公司 语音评测的方法、装置、计算设备及存储介质
CN113362813B (zh) * 2021-06-30 2024-05-28 北京搜狗科技发展有限公司 一种语音识别方法、装置和电子设备
CN113362812B (zh) * 2021-06-30 2024-02-13 北京搜狗科技发展有限公司 一种语音识别方法、装置和电子设备
CN115798277A (zh) * 2021-09-10 2023-03-14 广州视源电子科技股份有限公司 一种在线课堂交互的方法及在线课堂系统

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650886A (zh) * 2008-12-26 2010-02-17 中国科学院声学研究所 一种自动检测语言学习者朗读错误的方法
JP2011164336A (ja) * 2010-02-09 2011-08-25 Nippon Telegr & Teleph Corp <Ntt> 音声認識装置、重みベクトル学習装置、音声認識方法、重みベクトル学習方法、プログラム
CN102968989A (zh) * 2012-12-10 2013-03-13 中国科学院自动化研究所 一种用于语音识别的Ngram模型改进方法
CN103871403A (zh) * 2012-12-13 2014-06-18 北京百度网讯科技有限公司 建立语音识别模型的方法、语音识别方法及对应装置
CN106683677A (zh) * 2015-11-06 2017-05-17 阿里巴巴集团控股有限公司 语音识别方法及装置
CN107195296A (zh) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 一种语音识别方法、装置、终端及系统
CN107705787A (zh) * 2017-09-25 2018-02-16 北京捷通华声科技股份有限公司 一种语音识别方法及装置
CN109036391A (zh) * 2018-06-26 2018-12-18 华为技术有限公司 语音识别方法、装置及系统

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510222B (zh) * 2009-02-20 2012-05-30 北京大学 一种多层索引语音文档检索方法
JP5175325B2 (ja) * 2010-11-24 2013-04-03 日本電信電話株式会社 音声認識用wfst作成装置とそれを用いた音声認識装置と、それらの方法とプログラムと記憶媒体
KR20130059476A (ko) * 2011-11-28 2013-06-07 한국전자통신연구원 음성 인식용 탐색 공간 생성 방법 및 장치
CN104217717B (zh) * 2013-05-29 2016-11-23 腾讯科技(深圳)有限公司 构建语言模型的方法及装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650886A (zh) * 2008-12-26 2010-02-17 中国科学院声学研究所 一种自动检测语言学习者朗读错误的方法
JP2011164336A (ja) * 2010-02-09 2011-08-25 Nippon Telegr & Teleph Corp <Ntt> 音声認識装置、重みベクトル学習装置、音声認識方法、重みベクトル学習方法、プログラム
CN102968989A (zh) * 2012-12-10 2013-03-13 中国科学院自动化研究所 一种用于语音识别的Ngram模型改进方法
CN103871403A (zh) * 2012-12-13 2014-06-18 北京百度网讯科技有限公司 建立语音识别模型的方法、语音识别方法及对应装置
CN106683677A (zh) * 2015-11-06 2017-05-17 阿里巴巴集团控股有限公司 语音识别方法及装置
CN107195296A (zh) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 一种语音识别方法、装置、终端及系统
CN107705787A (zh) * 2017-09-25 2018-02-16 北京捷通华声科技股份有限公司 一种语音识别方法及装置
CN109036391A (zh) * 2018-06-26 2018-12-18 华为技术有限公司 语音识别方法、装置及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112116908A (zh) * 2020-11-18 2020-12-22 北京声智科技有限公司 唤醒音频确定方法、装置、设备及存储介质
CN112116908B (zh) * 2020-11-18 2021-02-23 北京声智科技有限公司 唤醒音频确定方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN109036391B (zh) 2021-02-05
CN109036391A (zh) 2018-12-18

Similar Documents

Publication Publication Date Title
WO2020001458A1 (zh) 语音识别方法、装置及系统
US11238845B2 (en) Multi-dialect and multilingual speech recognition
US20210166682A1 (en) Scalable dynamic class language modeling
CN109887497B (zh) 语音识别的建模方法、装置及设备
US10095684B2 (en) Trained data input system
CN110033760B (zh) 语音识别的建模方法、装置及设备
US9123339B1 (en) Speech recognition using repeated utterances
US9558743B2 (en) Integration of semantic context information
TWI532035B (zh) 語言模型的建立方法、語音辨識方法及電子裝置
KR100755677B1 (ko) 주제 영역 검출을 이용한 대화체 음성 인식 장치 및 방법
JP7092953B2 (ja) エンドツーエンドモデルによる多言語音声認識のための音素に基づく文脈解析
JP5752060B2 (ja) 情報処理装置、大語彙連続音声認識方法及びプログラム
JP7266683B2 (ja) 音声対話に基づく情報検証方法、装置、デバイス、コンピュータ記憶媒体、およびコンピュータプログラム
WO2021051514A1 (zh) 一种语音识别方法、装置、计算机设备及非易失性存储介质
CN112397056B (zh) 语音评测方法及计算机存储介质
KR20230156125A (ko) 룩업 테이블 순환 언어 모델
KR102409873B1 (ko) 증강된 일관성 정규화를 이용한 음성 인식 모델 학습 방법 및 시스템
KR20200140171A (ko) 전자 장치 및 이의 제어 방법
US20230186898A1 (en) Lattice Speech Corrections
KR20130050132A (ko) 오류 발음 검출을 위한 단말 및 음성 인식 장치, 그리고 그의 음향 모델 학습 방법
KR102637025B1 (ko) 자동 음성 인식을 위한 다언어 리스코어링 모델들
US20240185842A1 (en) Interactive decoding of words from phoneme score distributions
KR20240096898A (ko) 격자 음성 교정
KR20230000175A (ko) Ai 기반 발음 평가 방법, 발음 코칭용 학습 컨텐츠 제공 방법 및 이를 수행하기 위한 컴퓨팅 시스템

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19826529

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19826529

Country of ref document: EP

Kind code of ref document: A1