WO2017101450A1 - 语音识别方法和装置 - Google Patents

语音识别方法和装置 Download PDF

Info

Publication number
WO2017101450A1
WO2017101450A1 PCT/CN2016/091765 CN2016091765W WO2017101450A1 WO 2017101450 A1 WO2017101450 A1 WO 2017101450A1 CN 2016091765 W CN2016091765 W CN 2016091765W WO 2017101450 A1 WO2017101450 A1 WO 2017101450A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
decoding
path
score
blank
Prior art date
Application number
PCT/CN2016/091765
Other languages
English (en)
French (fr)
Inventor
钱胜
潘复平
Original Assignee
百度在线网络技术(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百度在线网络技术(北京)有限公司 filed Critical 百度在线网络技术(北京)有限公司
Priority to US15/758,159 priority Critical patent/US10650809B2/en
Publication of WO2017101450A1 publication Critical patent/WO2017101450A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Definitions

  • the present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method and apparatus.
  • HMM Hidden Markov Model
  • speech recognition is performed based on a Hidden Markov Model (HMM).
  • HMM can be regarded as a mathematical double stochastic process: one is to use the Markov Markov chain with finite state numbers to simulate the implicit stochastic process of the statistical characteristics of speech signals, and the other is to interact with the Markov Markov chain.
  • a stochastic process of a state-associated observation sequence In this modeling approach, a phoneme or a syllable is considered to be divided into multiple states without physical meaning, and then the discrete or continuous Gaussian model or deep learning model is used to describe the output distribution of each state.
  • the state modeling method in the process of speech recognition, when the two pronunciation units are identified, confusion is likely to occur, and the recognition performance is poor.
  • the present invention aims to solve the above technical problems at least to some extent.
  • a first object of the present invention is to provide a speech recognition method capable of improving the accuracy of speech recognition and improving the decoding speed in the recognition process.
  • a second object of the present invention is to provide a speech recognition apparatus.
  • a speech recognition method includes the steps of: receiving a speech signal; decoding the speech signal according to a pre-established acoustic model, a language model, and a decoding network, and The blank unit is dynamically added in the decoding process to obtain an optimal decoding path after adding the blank unit, wherein the acoustic model is obtained based on the connection timing classification training, wherein the acoustic model includes a basic pronunciation unit and the blank unit
  • the decoding network is composed of a plurality of decoding paths formed by the basic sounding unit; and the optimal decoding path is output as a recognition result of the voice signal.
  • a speech recognition method an acoustic model constructed based on connection timing classification and a decoding network to speech
  • the signal is decoded, and the blank unit is dynamically added in the decoding process to obtain the optimal decoding path after adding the blank unit, and as the recognition result of the speech signal, the problem of confusion between the two pronunciation units can be solved, and the speech recognition is improved.
  • Accuracy and can effectively reduce the possible decoding path and improve the decoding speed in the recognition process.
  • the second aspect of the present invention provides a voice recognition apparatus, including: a receiving module, configured to receive a voice signal; and a decoding module, configured to decode the voice signal according to a pre-established acoustic model, a language model, and a decoding network. And dynamically adding a blank unit in the decoding process to obtain an optimal decoding path after adding the blank unit, wherein the acoustic model is obtained based on connection timing classification training, wherein the acoustic model includes a basic pronunciation unit and the a blank unit, the decoding network is composed of a plurality of decoding paths formed by the basic sounding unit, and an output module is configured to output the optimal decoding path as a recognition result of the voice signal.
  • the speech recognition apparatus of the embodiment of the present invention decodes the speech signal based on the acoustic model and the decoding network constructed by the connection timing classification, and dynamically adds a blank unit in the decoding process to obtain an optimal decoding path after adding the blank unit, and
  • the recognition result of the speech signal can solve the problem of confusion between the two pronunciation units, improve the accuracy of speech recognition, and can effectively reduce the possible decoding path and improve the decoding speed in the recognition process.
  • An embodiment of the third aspect of the present invention provides an electronic device comprising: one or more processors; a memory; one or more programs, the one or more programs being stored in the memory when Or when the plurality of processors are executed, the speech recognition method of the embodiment of the first aspect of the present invention is executed.
  • a fourth aspect of the present invention provides a non-volatile computer storage medium storing one or more programs, when the one or more programs are executed by one device, causing the device A speech recognition method in accordance with an embodiment of the first aspect of the present invention is performed.
  • FIG. 1 is a flow chart of a speech recognition method in accordance with one embodiment of the present invention.
  • FIG. 2 is a schematic diagram of a decoding network in accordance with one embodiment of the present invention.
  • FIG. 3 is a flowchart of a voice recognition method according to another embodiment of the present invention.
  • FIG. 4a is a schematic diagram of a node S in a decoding network according to an embodiment of the present invention.
  • 4b is a topological diagram of adding a blank node to node S of FIG. 4a according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of intermediate recognition confusion between two pronunciation units in a voice recognition method according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram 1 of a voice recognition apparatus according to an embodiment of the present invention.
  • FIG. 7 is a second schematic structural diagram of a voice recognition apparatus according to an embodiment of the present invention.
  • FIG. 8 is a third schematic structural diagram of a voice recognition apparatus according to an embodiment of the present invention.
  • FIG. 9 is a fourth schematic structural diagram of a voice recognition apparatus according to an embodiment of the present invention.
  • a speech recognition method includes the steps of: receiving a speech signal; decoding a speech signal according to a pre-established acoustic model, a language model, and a decoding network, and dynamically adding a blank unit in the decoding process to obtain the most after adding the blank unit
  • An excellent decoding path wherein the acoustic model is obtained based on the connection timing classification training, the acoustic model includes a basic pronunciation unit and a blank unit, the decoding network is composed of a plurality of decoding paths composed of basic pronunciation units; and the optimal decoding path is output as speech The result of the signal recognition.
  • FIG. 1 is a flow chart of a speech recognition method in accordance with one embodiment of the present invention.
  • a speech recognition method includes the following steps.
  • the acoustic model includes a basic pronunciation unit and a blank unit
  • the decoding network is composed of a plurality of decoding paths composed of basic pronunciation units.
  • the pre-established acoustic model is trained based on CTC (connectionist temporal classification) techniques. Specifically, feature extraction can be performed on a large number of speech signals to obtain feature vectors of the respective speech signals. Then, a blank label is added every predetermined number of pronunciation units in the feature vector, and the voice signal after adding the blank label is trained based on the connection timing classification to establish an acoustic model.
  • the acoustic model includes a plurality of basic sounding units and blank units.
  • the language model may be any existing language model that may or may not appear in the future.
  • the plurality of basic pronunciation units in the acoustic model and the jump relationship between them can form a large number of decoding paths, and these decoding paths can constitute a decoding network.
  • the basic pronunciation unit can be a complete initial or final, which can be called a phoneme.
  • Figure 2 is a schematic diagram of a decoding network in accordance with one embodiment of the present invention.
  • a dotted circle is used to identify the beginning of the decoding path
  • a solid circle (such as A and B) represents the basic pronunciation unit in the decoding network
  • an arrow identifies the jump path between the basic pronunciation units.
  • the process of decoding the speech signal is the process of selecting an optimal decoding path from a plurality of decoding paths in the decoding network based on the feature vector frame of the speech signal.
  • S102 may specifically include S201-S204:
  • the process of extending the decoding path that is, the process of stepping forward from the starting position in the decoding network along the jump path between the respective basic sounding units to the end position of the decoding network.
  • the speech signal arrival feature vector frame i extension has been completed and at least one decoding path is obtained (which may be referred to as a current decoding path)
  • the current decoding path may be further extended according to each jump path of the basic pronunciation unit A in the decoding network to obtain a possible extended path.
  • the feature vector frame i in the speech signal is further forwarded to a possible jump path of the feature vector frame i+1.
  • a black unit may be added to the basic pronunciation unit, and a blank unit-related jump path may be added.
  • a first basic pronunciation unit to which each decoding path is currently extended may be determined; a jump path that jumps from the first basic pronunciation unit to the blank unit and jumps to the blank unit by the blank unit is added to the first basic pronunciation unit to At least one extended path after the blank unit is added for the first basic pronunciation unit is generated.
  • the topology map after adding the black unit can be as shown in FIG. 4b, in the original S_>S (ie, jump from S to S).
  • the path of S->blank and blank->blank is added to the path. Therefore, on the basis of the jump path based on the decoding network, when extending to a basic pronunciation unit, a blank unit-related jump path is added for the basic pronunciation unit, and according to the added jump path pair The current decoding path is expanded.
  • the blank unit represents a non-sounding unit and can be used to identify pauses between phonemes and between words in a speech signal.
  • Embodiments of the present invention better solve the intermediate confusion between two pronunciation units by adding a blank unit for each pronunciation unit.
  • the traditional "forced alignment" is generally classified as the left label, the right label or the short pause in the middle of the two pronunciation units, which easily leads to inaccurate recognition and confusion in the middle of the two pronunciation units. .
  • FIG. 5 is a schematic diagram of the middle recognition confusion of two pronunciation units in the speech recognition method according to an embodiment of the present invention. As can be seen from FIG. 5, the method of adding a blank unit is adopted. There is no confusion and the accuracy of speech recognition can be improved.
  • the blank unit is dynamically added in the process of expanding, that is, when extending to the basic pronunciation unit, the jump path related to the blank unit is added to the basic pronunciation unit, and the basic pronunciation unit is The jump path is merged with the blank unit-related jump path, which can effectively reduce the possible decoding path and accelerate the decoding process.
  • the scores of each possible extension path may be determined on the acoustic model and the language model according to the feature vector frame i+1. Subsequent screening may be performed on the possible extended paths according to the score to obtain a decoding path corresponding to the speech signal when the feature vector frame i+1 is reached (S203).
  • the score of the extended path is the sum of the acoustic model score and the language model score of each basic pronunciation unit on the extended path.
  • the acoustic model score of B can be obtained according to the acoustic model
  • the language model score of B is obtained according to the language model
  • Both the acoustic model score and the language model score of B are accumulated to the score of the decoding path that is not extended to B, thereby obtaining the score of the extended path.
  • the acoustic model score and the language model score for obtaining the basic pronunciation unit are the same as those of the prior art, and will not be described in detail herein.
  • the updated current decoding path is increased with respect to the pre-update decoding path by a unit node corresponding to the feature vector frame i+1 (which may be a basic pronunciation unit or a blank unit).
  • a preset number of extended paths with higher scores are selected as the new current decoding path.
  • the difference between the score of the at least one extended path and the highest score in each of the current decoding paths may also be separately obtained, if the difference between the score of the extended path and the highest score is less than a preset. Threshold, the extended path is taken as the new current decoding path.
  • the score Due to the acoustic model trained based on CTC technology, the score has a typical spike phenomenon, that is, when a feature vector frame of the speech signal is located at a certain basic pronunciation unit, then for the feature vector frame, the basic pronunciation unit is The acoustic model score will be significantly higher than the scores of other units. For feature vector frames that are not at the basic pronunciation unit, the blank unit scores significantly higher than other units. That is to say, if the score of the blank unit is the highest for a certain feature vector frame, it means that the feature vector frame is not in any one of the basic pronunciation units.
  • the cropping strategy may be formulated according to the peak phenomenon in the extended path and the score of the blank unit corresponding to the basic sounding unit.
  • the score of the blank unit and the score of the first basic sounding unit may be separately obtained according to the current feature vector frame; if the score of the first basic sounding unit is smaller than the score of the blank unit, then the judgment is made.
  • the preset threshold is lowered when the extended path of the first basic pronunciation unit is entered as a new current decoding path.
  • the score is the sum of the language model score and the acoustic model score.
  • the score of the current feature vector frame (ie, feature vector frame i+1) at A and the score of the current feature vector frame at blank can be obtained. If the score of the current feature vector frame at A is smaller than the score at blank, it indicates that there are two possibilities, one, the current feature vector frame should be in blank, or two, the current feature vector frame is in a unit higher than the blank score. Therefore, when judging whether the extended path entering the basic pronunciation unit A can be regarded as a new current decoding path, the clipping threshold should be narrowed, that is, the preset threshold is reduced to make the expansion path into the basic pronunciation unit A stricter. Cropping. Thereby, the number of extended paths can be reduced, and the decoding speed can be improved.
  • the decoding path reaches the end of the word, it is necessary to query the actual language model score of the decoding path. Therefore, when it is judged whether the extended path reaching the ending can be the new current decoding path, the preset threshold is lowered, and the extended path of the ending can be more strictly tailored to reduce the extended path, thereby reducing the query language model. The number of scores further increases the decoding speed.
  • the current feature vector frame is the last feature vector frame of the speech signal, it indicates that the path extension has been completed, and therefore, an optimal decoding path can be selected from all the decoded paths.
  • the decoding path with the highest score may be selected from the current decoding path according to the score of each decoding path as the optimal decoding path.
  • the speech recognition method of the embodiment of the present invention decodes the speech signal based on the acoustic model and the decoding network constructed by the connection timing classification, and dynamically adds a blank unit in the decoding process to obtain an optimal decoding path after adding the blank unit, and
  • the recognition result of the speech signal can solve the problem of confusion between the two pronunciation units, improve the accuracy of speech recognition, and can effectively reduce the possible decoding path and improve the decoding speed in the recognition process.
  • the present invention also proposes a speech recognition apparatus.
  • a voice recognition device includes: a receiving module, configured to receive a voice signal; a decoding module, configured to decode a voice signal according to a pre-established acoustic model, a language model, and a decoding network, and dynamically add a blank unit in the decoding process, To obtain an optimal decoding path after adding a blank unit, wherein the acoustic model is obtained based on the connection timing classification training, the acoustic model includes a basic pronunciation unit and a blank unit, and the decoding network is composed of a plurality of decoding paths composed of basic pronunciation units; And an output module, configured to output the optimal decoding path as a recognition result of the voice signal.
  • FIG. 6 is a schematic structural diagram 1 of a voice recognition apparatus according to an embodiment of the present invention.
  • a voice recognition apparatus includes: a receiving module 10, a decoding module 20, and an output module 30.
  • the receiving module 10 is configured to receive a voice signal.
  • the decoding module 20 is configured to decode the voice signal according to the pre-established acoustic model, the language model, and the decoding network, and dynamically add a blank unit in the decoding process to obtain an optimal decoding path after adding the blank unit, where the acoustic model is Based on the connection timing classification training, the acoustic model includes a basic pronunciation unit and a blank unit, and the decoding network is composed of a plurality of decoding paths composed of basic pronunciation units.
  • the pre-established acoustic model is trained based on CTC (connectionist temporal classification) techniques. Specifically, feature extraction can be performed on a large number of speech signals to obtain feature vectors of the respective speech signals. Then, a blank label is added every predetermined number of pronunciation units in the feature vector, and the voice signal after adding the blank label is trained based on the connection timing classification to establish an acoustic model.
  • the acoustic model includes a plurality of basic sounding units and blank units.
  • the language model may be any existing language model that may or may not appear in the future.
  • the plurality of basic pronunciation units in the acoustic model and the jump relationship between them can form a large number of decoding paths, and these decoding paths can constitute a decoding network.
  • the basic pronunciation unit can be a complete initial or final, which can be called a phoneme.
  • Figure 2 is a schematic diagram of a decoding network in accordance with one embodiment of the present invention.
  • a dotted circle is used to identify the beginning of the decoding path
  • a solid circle (such as A and B) represents the basic pronunciation unit in the decoding network
  • an arrow identifies the jump path between the basic pronunciation units.
  • the process of decoding the speech signal is the process of selecting an optimal decoding path from a plurality of decoding paths in the decoding network based on the feature vector frame of the speech signal.
  • the decoding module 20 may specifically include: an expansion unit 21, an adding unit 22, a first obtaining unit 23, a screening unit 24, and a selecting unit 25.
  • the extension unit 21 is configured to expand each of the current decoding paths according to the jump path in the decoding network.
  • the process in which the extension unit 21 expands the decoding path that is, the process from the start position in the decoding network along the jump path between the respective basic sounding units to the end position of the decoding network.
  • the extension unit 21 may further expand the current decoding path according to each jump path of the basic pronunciation unit A in the decoding network to obtain a possible extension path.
  • the feature vector frame i in the speech signal is further forwarded to a possible jump path of the feature vector frame i+1.
  • the adding unit 22 is configured to dynamically add a blank unit during the expansion process to obtain at least one extended path after adding the blank unit.
  • the adding unit 22 may add a black unit to the basic sounding unit and add a blank unit related jump path.
  • the adding unit 22 may be configured to: determine a first basic sounding unit to which each decoding path is currently extended; add a jump to the blank unit by the first basic sounding unit, and jump to the blank unit by the blank unit for the first basic sounding unit Jumping a path to generate at least one extended path after adding a blank cell for the first basic pronunciation unit.
  • the topology map after adding the black unit can be as shown in FIG. 4b, in the original S_>S (ie, jump from S to S).
  • the path of S->blank and blank->blank is added to the path. Therefore, on the basis of the jump path based on the decoding network, when extending to a basic pronunciation unit, a blank unit-related jump path is added for the basic pronunciation unit, and according to the added jump path pair The current decoding path is expanded.
  • the blank unit represents a non-sounding unit and can be used to identify pauses between phonemes and between words in a speech signal.
  • the embodiment of the present invention solves the problem of frame classification in the middle of the confusion between two pronunciation units by adding a blank unit for each pronunciation unit.
  • the traditional "forced alignment" is generally classified as the left label for the middle of the two pronunciation units.
  • the right label or short pause which easily leads to inaccurate recognition and confusion in the middle of the two pronunciation units.
  • FIG. 5 is a schematic diagram of the middle recognition confusion of two pronunciation units in the speech recognition method according to an embodiment of the present invention. As can be seen from FIG. 5, the method of adding a blank unit is adopted. There is no confusion and the accuracy of speech recognition can be improved.
  • the blank unit is dynamically added in the process of expanding, that is, when extending to the basic pronunciation unit, the jump path related to the blank unit is added to the basic pronunciation unit, and the basic pronunciation unit is The jump path is merged with the blank unit-related jump path, which can effectively reduce the possible decoding path and accelerate the decoding process.
  • the first obtaining unit 23 is configured to respectively acquire the scores of the at least one extended path on the acoustic model and the language model according to the current feature vector frame extracted from the voice signal.
  • the first obtaining unit 23 may determine each possible on the acoustic model and the language model according to the feature vector frame i+1. The score of the extended path.
  • the subsequent screening unit 24 may filter the possible extended paths according to the scores to obtain a decoding path corresponding to the voice signals when the feature vector frame i+1 is reached.
  • the score of the extended path is the sum of the acoustic model score and the language model score of each basic pronunciation unit on the extended path.
  • the first acquisition unit 23 can obtain the acoustic model score of B according to the acoustic model, and obtain the language of B according to the language model.
  • the acoustic model score and the language model score for obtaining the basic pronunciation unit are the same as those of the prior art, and will not be described in detail herein.
  • the screening unit 24 is configured to filter the at least one extended path according to the score, and update the current decoding path according to the screening result.
  • the updated current decoding path is increased with respect to the pre-update decoding path by a unit node corresponding to the feature vector frame i+1 (which may be a basic pronunciation unit or a blank unit).
  • screening unit 24 may have a variety of methods for screening extended paths based on scores. For example, the screening unit 24 may select a higher predetermined number of extended paths as the new current decoding path.
  • the screening unit 24 is further configured to: separately obtain a difference between a score of the at least one extended path and a highest score in each of the current decoding paths; if the difference between the score of the extended path and the highest score If it is less than the preset threshold, the extended path is taken as the new current decoding path.
  • the selecting unit 25 is configured to select an optimal decoding path from the updated current decoding path according to the score if the current feature vector frame is the last feature vector frame of the speech signal.
  • the selection unit 25 can select an optimal decoding path from all the decoded paths. Specifically, the selecting unit 25 may select the decoding path with the highest score from the current decoding path as the optimal decoding path according to the score of each decoding path.
  • the output module 30 is configured to output the optimal decoding path as the recognition result of the speech signal.
  • the speech recognition apparatus of the embodiment of the present invention decodes the speech signal based on the acoustic model and the decoding network constructed by the connection timing classification, and dynamically adds a blank unit in the decoding process to obtain an optimal decoding path after adding the blank unit, and
  • the recognition result of the speech signal can solve the problem of confusion between the two pronunciation units, improve the accuracy of speech recognition, and can effectively reduce the possible decoding path and improve the decoding speed in the recognition process.
  • the score Due to the acoustic model trained based on CTC technology, the score has a typical spike phenomenon, that is, when a feature vector frame of the speech signal is located at a certain basic pronunciation unit, then for the feature vector frame, the basic pronunciation unit is The acoustic model score will be significantly higher than the scores of other units. For feature vector frames that are not at the basic pronunciation unit, the blank unit scores significantly higher than other units. That is to say, if the score of the blank unit is the highest for a certain feature vector frame, it means that the feature vector frame is not in any one of the basic pronunciation units.
  • the cropping strategy may be formulated according to the peak phenomenon in the extended path and the score of the blank unit corresponding to the basic sounding unit.
  • FIG. 8 is a third schematic structural diagram of a voice recognition apparatus according to an embodiment of the present invention.
  • the decoding module 20 may further include a second acquisition unit 26 and a first control unit 27.
  • the second obtaining unit 26 is configured to respectively obtain a score of the blank unit and a score of the first basic sounding unit according to the current feature vector frame.
  • the first control unit 27 is configured to lower the preset threshold when determining whether the extended path entering the first basic sounding unit can be used as a new current decoding path when the score of the blank unit is smaller than the score of the first basic sounding unit.
  • the score is the sum of the language model score and the acoustic model score.
  • the score of the current feature vector frame (ie, feature vector frame i+1) at A and the score of the current feature vector frame at blank can be obtained. If the score of the current feature vector frame at A is smaller than the score at blank, it indicates that there are two possibilities, one, the current feature vector frame should be in blank, or two, the current feature vector frame is in a unit higher than the blank score. Therefore, when judging whether the extended path entering the basic pronunciation unit A can be regarded as a new current decoding path, the clipping threshold should be narrowed, that is, the preset threshold is reduced to make the expansion path into the basic pronunciation unit A stricter. Cropping. Thereby, the number of extended paths can be reduced, and the decoding speed can be improved.
  • FIG. 9 is a fourth schematic structural diagram of a voice recognition apparatus according to an embodiment of the present invention.
  • the decoding module 20 may further include a judging unit 28 and a second control unit 29.
  • the determining unit 28 is configured to determine whether the at least one extended path reaches the ending of the word
  • the second control unit 29 is configured to decrease the preset threshold when determining whether the extended path reaching the ending can be the new current decoding path when the extended path reaches the ending.
  • the decoding path reaches the end of the word, it is necessary to query the actual language model score of the decoding path. Therefore, when it is judged whether the extended path reaching the ending can be the new current decoding path, the preset threshold is lowered, and the extended path of the ending can be more strictly tailored to reduce the extended path, thereby reducing the query language model. The number of scores further increases the decoding speed.
  • first and second are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated.
  • features defining “first” or “second” may include at least one of the features, either explicitly or implicitly.
  • the meaning of "a plurality” is two or more unless specifically and specifically defined otherwise.
  • a "computer-readable medium” can be any program that can contain, store, communicate, propagate, or transport a program for use in an instruction execution system, apparatus, or device, or in conjunction with such an instruction execution system, apparatus, or device. s installation.
  • computer readable media include the following: electrical connections (electronic devices) having one or more wires, portable computer disk cartridges (magnetic devices), random access memory (RAM), Read only memory (ROM), erasable editable read only memory (EPROM or flash memory), fiber optic devices, and portable compact disk read only memory (CDROM).
  • the computer readable medium may even be a paper or other suitable medium on which the program can be printed, as it may be optically scanned, for example by paper or other medium, followed by editing, interpretation or, if appropriate, other suitable The method is processed to obtain the program electronically and then stored in computer memory.
  • portions of the invention may be implemented in hardware, software, firmware or a combination thereof.
  • multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.
  • each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.
  • the above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

Abstract

一种语音识别方法和装置,该方法包括:接收语音信号(S101);根据预先建立的声学模型、语言模型和解码网络对语音信号进行解码,并在解码过程中动态添加空白单元,以得到添加空白单元后的最优解码路径,其中,声学模型是基于连接时序分类训练得到的,声学模型中包括基本发音单元和空白单元,解码网络由基本发音单元构成的多个解码路径组成(S102);将最优解码路径输出为语音信号的识别结果(S103)。该语音识别方法能够提高语音识别的准确性,并提高识别过程中的解码速度。

Description

语音识别方法和装置
相关申请的交叉引用
本申请要求百度在线网络技术(北京)有限公司于2015年12月14日提交的、发明名称为“语音识别方法和装置”的、中国专利申请号“201510925644.6”的优先权。
技术领域
本发明涉及语音识别技术领域,特别涉及一种语音识别方法和装置。
背景技术
传统的语音识别技术,大多是基于状态建模的语音识别模型进行语音识别的。例如,基于隐马尔科夫模型(Hidden Markov Model;以下简称:HMM)进行语音识别。HMM可以看作一个数学上的双重随机过程:一个是用具有有限状态数的马尔科夫Markov链来模拟语音信号统计特性变化的隐含的随机过程,另一个是与马尔科夫Markov链的每一个状态相关联的观测序列的随机过程。在这种建模方式中,一个音素或者一个音节被认为可分为多个没有物理意义的状态,然后采用离散或者连续高斯模型或深度学习模型描述每个状态的输出分布。但是,基于状态建模的方式,在语音识别的过程中,在对两个发音单元之间处进行识别时,容易出现混淆,识别性能较差。
发明内容
本发明旨在至少在一定程度上解决上述技术问题。
为此,本发明的第一个目的在于提出一种语音识别方法,能够提高语音识别的准确性,并提高识别过程中的解码速度。
本发明的第二个目的在于提出一种语音识别装置。
为达上述目的,根据本发明第一方面实施例提出了一种语音识别方法,包括以下步骤:接收语音信号;根据预先建立的声学模型、语言模型和解码网络对所述语音信号进行解码,并在解码过程中动态添加空白单元,以得到添加空白单元后的最优解码路径,其中,所述声学模型是基于连接时序分类训练得到的,所述声学模型中包括基本发音单元和所述空白单元,所述解码网络由所述基本发音单元构成的多个解码路径组成;将所述最优解码路径输出为所述语音信号的识别结果。
本发明实施例的语音识别方法,基于连接时序分类构建的声学模型和解码网络对语音 信号进行解码,并在解码过程中动态添加空白单元,以得到添加空白单元后的最优解码路径,并作为语音信号的识别结果,能够解决两个发音单元中间出现混淆的问题,提高语音识别的准确性,并能够有效减少可能的解码路径,提高识别过程中的解码速度。
本发明第二方面实施例提出了一种语音识别装置,包括:接收模块,用于接收语音信号;解码模块,用于根据预先建立的声学模型、语言模型和解码网络对所述语音信号进行解码,并在解码过程中动态添加空白单元,以得到添加空白单元后的最优解码路径,其中,所述声学模型是基于连接时序分类训练得到的,所述声学模型中包括基本发音单元和所述空白单元,所述解码网络由所述基本发音单元构成的多个解码路径组成;输出模块,用于将所述最优解码路径输出为所述语音信号的识别结果。
本发明实施例的语音识别装置,基于连接时序分类构建的声学模型和解码网络对语音信号进行解码,并在解码过程中动态添加空白单元,以得到添加空白单元后的最优解码路径,并作为语音信号的识别结果,能够解决两个发音单元中间出现混淆的问题,提高语音识别的准确性,并能够有效减少可能的解码路径,提高识别过程中的解码速度。
本发明第三方面实施例提供了一种电子设备,包括:一个或者多个处理器;存储器;一个或者多个程序,所述一个或者多个程序存储在所述存储器中,当被所述一个或者多个处理器执行时,执行本发明第一方面实施例的语音识别方法。
本发明第四方面实施例提供了一种非易失性计算机存储介质,所述计算机存储介质存储有一个或者多个程序,当所述一个或者多个程序被一个设备执行时,使得所述设备执行以本发明第一方面实施例的语音识别方法。
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
附图说明
本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解,其中:
图1为根据本发明一个实施例的语音识别方法的流程图;
图2为根据本发明一个实施例中解码网络的示意图;
图3为根据本发明另一个实施例的语音识别方法的流程图;
图4a为根据本发明一个实施例的解码网络中的节点S的示意图;
图4b为根据本发明一个实施例的对图4a中节点S添加blank节点后的拓扑图;
图5为本发明一个实施例的语音识别方法中两个发音单元中间识别混淆的示意图;
图6为根据本发明一个实施例的语音识别装置的结构示意图一;
图7为根据本发明一个实施例的语音识别装置的结构示意图二;
图8为根据本发明一个实施例的语音识别装置的结构示意图三;
图9为根据本发明一个实施例的语音识别装置的结构示意图四。
具体实施方式
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。
在本发明的描述中,需要理解的是,术语“多个”指两个或两个以上;术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性。
下面参考附图描述根据本发明实施例的语音识别方法和装置。
一种语音识别方法,包括以下步骤:接收语音信号;根据预先建立的声学模型、语言模型和解码网络对语音信号进行解码,并在解码过程中动态添加空白单元,以得到添加空白单元后的最优解码路径,其中,声学模型是基于连接时序分类训练得到的,声学模型中包括基本发音单元和空白单元,解码网络由基本发音单元构成的多个解码路径组成;将最优解码路径输出为语音信号的识别结果。
图1为根据本发明一个实施例的语音识别方法的流程图。
如图1所示,根据本发明实施例的语音识别方法,包括以下步骤。
S101,接收语音信号。
S102,根据预先建立的声学模型、语言模型和解码网络对语音信号进行解码,并在解码过程中动态添加空白单元,以得到添加空白单元后的最优解码路径,其中,声学模型是基于连接时序分类训练得到的,声学模型中包括基本发音单元和空白单元,解码网络由基本发音单元构成的多个解码路径组成。
在本发明的一个实施例中,预先建立的声学模型是基于CTC(connectionist temporal classification,连接时序分类)技术训练得到的。具体地,可对大量的语音信号进行特征提取,以得到各语音信号的特征向量。然后在特征向量中每隔预定数量的发音单元添加空白标签,并基于连接时序分类对添加所述空白标签后的语音信号进行训练,建立声学模型。其中,声学模型中包括多个基本发音单元和空白单元。
语言模型可为现有的或者未来可能出现的任意语言模型本发明对此不做限定。
声学模型中的多个基本发音单元及其之间的跳转关系(即跳转路径)可以形成大量的解码路径,这些解码路径即可构成解码网络。
其中,基本发音单元可为完整的声母或韵母,可被称为音素。
举例来说,图2为根据本发明一个实施例中解码网络的示意图。如图2所示,其中,虚线圆圈用于标识解码路径的开始,实线圆圈(如A和B)表示解码网络中的基本发音单元,箭头标识基本发音单元之间的跳转路径。由图2可知,解码网络中存在多个解码路径。每条解码路径为对语音信号进行解码时的一种可能解码结果。
在本发明的实施例中,对语音信号进行解码的过程即为根据语音信号的特征向量帧从解码网络中的多个解码路径中选择最优解码路径的过程。
在本发明的一个实施例中,如图3所示,S102可具体包括S201-S204:
S201,根据解码网络中的跳转路径,对当前各解码路径进行扩展,并在扩展过程中动态添加空白单元,以得到添加空白单元后的至少一个扩展路径。
对解码路径进行扩展的过程,即从解码网络中起始位置沿着各个基本发音单元之间的跳转路径向解码网络的结束位置一步步前进的过程。
举例来说,如果已经完成语音信号到达特征向量帧i扩展,并得到了至少一个解码路径(可称为当前解码路径),假设特征向量帧i在其中一个当前解码路径中对应的基本发音单元为A,则可根据解码网络中基本发音单元A的各个跳转路径分别对当前解码路径进行进一步扩展以得到可能的扩展路径。其中,在解码网络中每前进一步表示语音信号中的特征向量帧i跳转至特征向量帧i+1的一个可能的跳转路径。
在本发明的实施例中,随着路径扩展的进行,扩展到达一个基本发音单元时,可为该基本发音单元添加空白(black)单元,并添加空白单元相关的跳转路径。具体地,可确定各解码路径当前扩展到的第一基本发音单元;为第一基本发音单元添加由第一基本发音单元跳转至空白单元、由空白单元跳转至自身的跳转路径,以生成针对第一基本发音单元添加空白单元之后的至少一个扩展路径。
举例来说,对于图4a中的解码网络中的节点S,其添加空白(black)单元之后的拓扑图可如图4b所示,在原来的S—>S(即由S跳转至S)的路径基础上增加了S—>blank以及blank—>blank的路径。由此,相当于在基于解码网络中的跳转路径的基础上,在扩展到一个基本发音单元时,为该基本发音单元添加了空白单元相关的跳转路径,并根据添加的跳转路径对当前解码路径进行扩展。
由此,在解码路径中在进入S后,能够得到“S—>S(可重复若干次,次数大于或等于0),S—>blank,blank—>blank(可重复若干次,次数大于或等于0),blank—>出口(解码路径中个S的下一个基本发音单元)”的可能扩展路径。其中的每一步跳转都表示语音信号中的特征向量帧的跳转。
其中,blank单元表示非发音单元,可用于标识语音信号中音素间以及词间的停顿。本发明的实施例通过针对每个发音单元添加blank单元较好地解决了两个发音单元中间混淆 处的帧分类问题,传统的“强制对齐”对两个发音单元中间混淆处一般分类为左边标签、右边标签或者短停顿,这样容易导致对两个发音单元中间混淆处的识别不准确,出现混淆。如图5中方框框住的部分所示,图5为本发明一个实施例的语音识别方法中两个发音单元中间识别混淆的示意图,从图5中可以看出,而采用添加blank单元的方式则不会出现混淆,可以提高语音识别的准确率。
此外,本发明的实施例,在扩展的过程中动态添加blank单元,也就是说,当扩展到基本发音单元时,才在该基本发音单元处添加blank单元相关的跳转路径,将基本发音单元的跳转路径与blank单元相关跳转路径进行了合并保留,能够有效减少可能的解码路径,从而加速解码过程。
S202,根据从语音信号中提取的当前特征向量帧在声学模型和语言模型上分别获取所述至少一个扩展路径的得分。
举例来说,对于上述示例中由基本发音单元为A的跳转路径扩展得到的可能的扩展路径,可根据特征向量帧i+1在声学模型和语言模型上确定各个可能的扩展路径的得分。后续可根据得分对可能的扩展路径进行筛选,以得到到达特征向量帧i+1时语音信号对应的解码路径(S203)。
其中,扩展路径的得分为扩展路径上各个基本发音单元的声学模型得分和语言模型得分的总和。具体地,举例来说,假设在扩展路径中基本发音单元为A跳转到基本发音单元B,则可根据声学模型获得B的声学模型得分,并根据语言模型得到B的语言模型得分,并将B的声学模型得分和语言模型得分都累计到未扩展到B之前的解码路径的得分上,从而得到扩展路径的得分。获取基本发音单元的声学模型得分和语言模型得分与现有技术相同,在此不再详细说明。
S203,根据得分对所述至少一个扩展路径进行筛选,并根据筛选结果更新当前解码路径。
其中,更新后的当前解码路径相对于更新前的解码路径来说,增加了与特征向量帧i+1对应的单元节点(可能为基本发音单元,也可能为blank单元)。
在本发明的实施例中,根据得分筛选扩展路径可有多种方法。例如选取得分较高的预设数量的扩展路径作为新的当前解码路径。
在本发明的一个实施例中,也可分别获取所述至少一个扩展路径的得分与当前各解码路径中的最高得分的差值,如果扩展路径的得分与所述最高得分的差值小于预设阈值,则将扩展路径作为新的当前解码路径。
当然,本发明并不仅限于上述列举的方法,通过其他筛选规则进行筛选的方法均适用于本发明。
由于基于CTC技术训练出的声学模型,其得分具有典型的尖峰现象,即当语音信号的一个特征向量帧位于某个基本发音单元处时,则对于该特征向量帧来说,该基本发音单元的声学模型得分会明显高于其他单元的得分。而对于不在基本发音单元处的特征向量帧来说,blank单元的得分要明显高于其他单元。也就是说,如果对于某一特征向量帧来说,blank单元的得分最高时,表示该特征向量帧并未处于任何一个基本发音单元。
为了减少解码过程中可能的解码路径的数量,可在路径扩展的过程中进行路径裁剪。因此,本发明的实施例中,可基于上述尖峰现象,根据扩展路径中基本发音单元以及该基本发音单元对应的blank单元的得分制定裁剪策略。
具体地,在本发明的一个实施例中,可根据当前特征向量帧分别获取空白单元的得分和第一基本发音单元的得分;如果第一基本发音单元的得分小于空白单元的得分,则在判断进入第一基本发音单元的扩展路径能否作为新的当前解码路径时,降低所述预设阈值。其中,得分为语言模型得分与声学模型得分之和。
举例来说,在上述示例中,在解码路径到达基本发音单元A后,可获取当前特征向量帧(即特征向量帧i+1)在A处的得分以及当前特征向量帧在blank处的得分。如果当前特征向量帧在A处的得分小于在blank处的得分,则表示有两种可能,一、当前特征向量帧应该处于blank,或者二、当前特征向量帧处于比blank得分高的单元。因此,应当在判断进入基本发音单元A的扩展路径能否作为新的当前解码路径时,应当收窄裁剪阈值,即减小上述预设阈值,以对进入基本发音单元A的扩展路径进行更严格的裁剪。从而能够减少扩展路径的数量,提高解码速度。
进一步地,在本发明的一个实施例中,还可判断所述至少一个扩展路径是否到达词尾;如果扩展路径到达词尾,则在判断到达词尾的扩展路径能否作为新的当前解码路径时,降低预设阈值。
由于在解码过程中,当解码路径到达词尾时,需要查询该解码路径的实际语言模型得分。因此,在判断到达词尾的扩展路径能否作为新的当前解码路径时,降低所述预设阈值,能够对达词尾的扩展路径进行更严格的裁剪,以减少扩展路径,从而能够减少查询语言模型得分的次数,进一步提高解码速度。
S204,如果当前特征向量帧为语音信号的最后特征向量帧,则根据得分从更新后的当前解码路径中选择最优解码路径。
如果当前特征向量帧为语音信号的最后特征向量帧,则表示路径扩展已经完成,因此,可从得到所有解码路径中选择一条最优解码路径。具体地,可根据每条解码路径的得分从当前解码路径中选择得分最高的解码路径作为最优解码路径。
S103,将最优解码路径输出为语音信号的识别结果。
本发明实施例的语音识别方法,基于连接时序分类构建的声学模型和解码网络对语音信号进行解码,并在解码过程中动态添加空白单元,以得到添加空白单元后的最优解码路径,并作为语音信号的识别结果,能够解决两个发音单元中间出现混淆的问题,提高语音识别的准确性,并能够有效减少可能的解码路径,提高识别过程中的解码速度。
为了实现上述实施例,本发明还提出一种语音识别装置。
一种语音识别装置,包括:接收模块,用于接收语音信号;解码模块,用于根据预先建立的声学模型、语言模型和解码网络对语音信号进行解码,并在解码过程中动态添加空白单元,以得到添加空白单元后的最优解码路径,其中,声学模型是基于连接时序分类训练得到的,声学模型中包括基本发音单元和空白单元,解码网络由基本发音单元构成的多个解码路径组成;输出模块,用于将最优解码路径输出为语音信号的识别结果。
图6为根据本发明一个实施例的语音识别装置的结构示意图一。
如图6所示,根据本发明实施例的语音识别装置,包括:接收模块10、解码模块20和输出模块30。
具体地,接收模块10用于接收语音信号。
解码模块20用于根据预先建立的声学模型、语言模型和解码网络对语音信号进行解码,并在解码过程中动态添加空白单元,以得到添加空白单元后的最优解码路径,其中,声学模型是基于连接时序分类训练得到的,声学模型中包括基本发音单元和空白单元,解码网络由基本发音单元构成的多个解码路径组成。
在本发明的一个实施例中,预先建立的声学模型是基于CTC(connectionist temporal classification,连接时序分类)技术训练得到的。具体地,可对大量的语音信号进行特征提取,以得到各语音信号的特征向量。然后在特征向量中每隔预定数量的发音单元添加空白标签,并基于连接时序分类对添加所述空白标签后的语音信号进行训练,建立声学模型。其中,声学模型中包括多个基本发音单元和空白单元。
语言模型可为现有的或者未来可能出现的任意语言模型本发明对此不做限定。
声学模型中的多个基本发音单元及其之间的跳转关系(即跳转路径)可以形成大量的解码路径,这些解码路径即可构成解码网络。
其中,基本发音单元可为完整的声母或韵母,可被称为音素。
举例来说,图2为根据本发明一个实施例中解码网络的示意图。如图2所示,其中,虚线圆圈用于标识解码路径的开始,实线圆圈(如A和B)表示解码网络中的基本发音单元,箭头标识基本发音单元之间的跳转路径。由图2可知,解码网络中存在多个解码路径。每条解码路径为对语音信号进行解码时的一种可能解码结果。
在本发明的实施例中,对语音信号进行解码的过程即为根据语音信号的特征向量帧从解码网络中的多个解码路径中选择最优解码路径的过程。
在本发明的一个实施例中,如图7所示,解码模块20可具体包括:扩展单元21、添加单元22、第一获取单元23、筛选单元24和选择单元25。
扩展单元21用于根据解码网络中的跳转路径,对当前各解码路径进行扩展。
扩展单元21对解码路径进行扩展的过程,即从解码网络中起始位置沿着各个基本发音单元之间的跳转路径向解码网络的结束位置一步步前进的过程。
举例来说,如果已经完成语音信号到达特征向量帧i扩展,并得到了至少一个解码路径(可称为当前解码路径),假设特征向量帧i在其中一个当前解码路径中对应的基本发音单元为A,则扩展单元21可根据解码网络中基本发音单元A的各个跳转路径分别对当前解码路径进行进一步扩展以得到可能的扩展路径。其中,在解码网络中每前进一步表示语音信号中的特征向量帧i跳转至特征向量帧i+1的一个可能的跳转路径。
添加单元22用于在扩展过程中动态添加空白单元,以得到添加空白单元后的至少一个扩展路径。
在本发明的实施例中,随着路径扩展的进行,扩展到达一个基本发音单元时,添加单元22可为该基本发音单元添加空白(black)单元,并添加空白单元相关的跳转路径。具体地,添加单元22可用于:确定各解码路径当前扩展到的第一基本发音单元;为第一基本发音单元添加由第一基本发音单元跳转至空白单元、由空白单元跳转至自身的跳转路径,以生成针对第一基本发音单元添加空白单元之后的至少一个扩展路径。
举例来说,对于图4a中的解码网络中的节点S,其添加空白(black)单元之后的拓扑图可如图4b所示,在原来的S—>S(即由S跳转至S)的路径基础上增加了S—>blank以及blank—>blank的路径。由此,相当于在基于解码网络中的跳转路径的基础上,在扩展到一个基本发音单元时,为该基本发音单元添加了空白单元相关的跳转路径,并根据添加的跳转路径对当前解码路径进行扩展。
由此,在解码路径中在进入S后,能够得到“S—>S(可重复若干次,次数大于或等于0),S—>blank,blank—>blank(可重复若干次,次数大于或等于0),blank—>出口(解码路径中个S的下一个基本发音单元)”的可能扩展路径。其中的每一步跳转都表示语音信号中的特征向量帧的跳转。
其中,blank单元表示非发音单元,可用于标识语音信号中音素间以及词间的停顿。本发明的实施例通过针对每个发音单元添加blank单元较好地解决了两个发音单元中间混淆处的帧分类问题,传统的“强制对齐”对两个发音单元中间混淆处一般分类为左边标签、右边标签或者短停顿,这样容易导致对两个发音单元中间混淆处的识别不准确,出现混淆。 如图5中方框框住的部分所示,图5为本发明一个实施例的语音识别方法中两个发音单元中间识别混淆的示意图,从图5中可以看出,而采用添加blank单元的方式则不会出现混淆,可以提高语音识别的准确率。
此外,本发明的实施例,在扩展的过程中动态添加blank单元,也就是说,当扩展到基本发音单元时,才在该基本发音单元处添加blank单元相关的跳转路径,将基本发音单元的跳转路径与blank单元相关跳转路径进行了合并保留,能够有效减少可能的解码路径,从而加速解码过程。
第一获取单元23用于根据从语音信号中提取的当前特征向量帧在声学模型和语言模型上分别获取所述至少一个扩展路径的得分。
举例来说,对于上述示例中由基本发音单元为A的跳转路径扩展得到的可能的扩展路径,第一获取单元23可根据特征向量帧i+1在声学模型和语言模型上确定各个可能的扩展路径的得分。后续筛选单元24可根据得分对可能的扩展路径进行筛选,以得到到达特征向量帧i+1时语音信号对应的解码路径。
其中,扩展路径的得分为扩展路径上各个基本发音单元的声学模型得分和语言模型得分的总和。具体地,举例来说,假设在扩展路径中基本发音单元为A跳转到基本发音单元B,则第一获取单元23可根据声学模型获得B的声学模型得分,并根据语言模型得到B的语言模型得分,并将B的声学模型得分和语言模型得分都累计到未扩展到B之前的解码路径的得分上,从而得到扩展路径的得分。获取基本发音单元的声学模型得分和语言模型得分与现有技术相同,在此不再详细说明。
筛选单元24用于根据得分对所述至少一个扩展路径进行筛选,并根据筛选结果更新当前解码路径。
其中,更新后的当前解码路径相对于更新前的解码路径来说,增加了与特征向量帧i+1对应的单元节点(可能为基本发音单元,也可能为blank单元)。
在本发明的实施例中,筛选单元24根据得分筛选扩展路径可有多种方法。例如筛选单元24可选取得分较高的预设数量的扩展路径作为新的当前解码路径。
在本发明的一个实施例中,筛选单元24也可用于:分别获取所述至少一个扩展路径的得分与当前各解码路径中的最高得分的差值;如果扩展路径的得分与最高得分的差值小于预设阈值,则将扩展路径作为新的当前解码路径。
当然,本发明并不仅限于上述列举的方法,通过其他筛选规则进行筛选的方法均适用于本发明。
选择单元25用于如果当前特征向量帧为语音信号的最后特征向量帧,则根据得分从更新后的当前解码路径中选择最优解码路径。
如果当前特征向量帧为语音信号的最后特征向量帧,则表示路径扩展已经完成,因此,选择单元25可从得到所有解码路径中选择一条最优解码路径。具体地,选择单元25可根据每条解码路径的得分从当前解码路径中选择得分最高的解码路径作为最优解码路径。
输出模块30用于将最优解码路径输出为语音信号的识别结果。
本发明实施例的语音识别装置,基于连接时序分类构建的声学模型和解码网络对语音信号进行解码,并在解码过程中动态添加空白单元,以得到添加空白单元后的最优解码路径,并作为语音信号的识别结果,能够解决两个发音单元中间出现混淆的问题,提高语音识别的准确性,并能够有效减少可能的解码路径,提高识别过程中的解码速度。
由于基于CTC技术训练出的声学模型,其得分具有典型的尖峰现象,即当语音信号的一个特征向量帧位于某个基本发音单元处时,则对于该特征向量帧来说,该基本发音单元的声学模型得分会明显高于其他单元的得分。而对于不在基本发音单元处的特征向量帧来说,blank单元的得分要明显高于其他单元。也就是说,如果对于某一特征向量帧来说,blank单元的得分最高时,表示该特征向量帧并未处于任何一个基本发音单元。
为了减少解码过程中可能的解码路径的数量,可在路径扩展的过程中进行路径裁剪。因此,本发明的实施例中,可基于上述尖峰现象,根据扩展路径中基本发音单元以及该基本发音单元对应的blank单元的得分制定裁剪策略。
下面以图8和图9进行示例性说明。
图8为根据本发明一个实施例的语音识别装置的结构示意图三。
如图8所示,本发明的实施例的语音识别装置中,在图7的基础之上,解码模块20还可包括第二获取单元26和第一控制单元27。
其中,第二获取单元26用于根据当前特征向量帧分别获取空白单元的得分和第一基本发音单元的得分。
第一控制单元27用于当空白单元的得分小于第一基本发音单元的得分时,在判断进入第一基本发音单元的扩展路径能否作为新的当前解码路径时,降低预设阈值。其中,得分为语言模型得分与声学模型得分之和。
举例来说,在上述示例中,在解码路径到达基本发音单元A后,可获取当前特征向量帧(即特征向量帧i+1)在A处的得分以及当前特征向量帧在blank处的得分。如果当前特征向量帧在A处的得分小于在blank处的得分,则表示有两种可能,一、当前特征向量帧应该处于blank,或者二、当前特征向量帧处于比blank得分高的单元。因此,应当在判断进入基本发音单元A的扩展路径能否作为新的当前解码路径时,应当收窄裁剪阈值,即减小上述预设阈值,以对进入基本发音单元A的扩展路径进行更严格的裁剪。从而能够减少扩展路径的数量,提高解码速度。
图9为根据本发明一个实施例的语音识别装置的结构示意图四。
如图9所示,本发明的实施例的语音识别装置中,在图7的基础之上,解码模块20还可包括判断单元28和第二控制单元29。
判断单元28用于判断所述至少一个扩展路径是否到达词尾;
第二控制单元29用于在扩展路径到达词尾时,在判断到达词尾的扩展路径能否作为新的当前解码路径时,则降低所述预设阈值。
由于在解码过程中,当解码路径到达词尾时,需要查询该解码路径的实际语言模型得分。因此,在判断到达词尾的扩展路径能否作为新的当前解码路径时,降低所述预设阈值,能够对达词尾的扩展路径进行更严格的裁剪,以减少扩展路径,从而能够减少查询语言模型得分的次数,进一步提高解码速度。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本发明的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本发明的实施例所属技术领域的技术人员所理解。
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用 的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式光盘只读存储器(CDROM)。另外,计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序,然后将其存储在计算机存储器中。
应当理解,本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。
此外,在本发明各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。
上述提到的存储介质可以是只读存储器,磁盘或光盘等。尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (14)

  1. 一种语音识别方法,其特征在于,包括以下步骤:
    接收语音信号;
    根据预先建立的声学模型、语言模型和解码网络对所述语音信号进行解码,并在解码过程中动态添加空白单元,以得到添加空白单元后的最优解码路径,其中,所述声学模型是基于连接时序分类训练得到的,所述声学模型中包括基本发音单元和所述空白单元,所述解码网络由所述基本发音单元构成的多个解码路径组成;
    将所述最优解码路径输出为所述语音信号的识别结果。
  2. 如权利要求1所述的方法,其特征在于,所述根据预先建立的声学模型、语言模型和解码网络对所述语音信号进行解码,并在解码过程中动态添加空白单元,以得到添加空白单元后的最优解码路径,包括:
    根据所述解码网络中的跳转路径,对当前各解码路径进行扩展,并在扩展过程中动态添加空白单元,以得到添加空白单元后的至少一个扩展路径;
    根据从所述语音信号中提取的当前特征向量帧在所述声学模型和所述语言模型上分别获取所述至少一个扩展路径的得分;
    根据所述得分对所述至少一个扩展路径进行筛选,并根据筛选结果更新所述当前解码路径;
    如果所述当前特征向量帧为所述语音信号的最后特征向量帧,则根据所述得分从所述更新后的当前解码路径中选择所述最优解码路径。
  3. 如权利要求2所述的方法,其特征在于,所述根据所述得分对所述至少一个扩展路径进行筛选,并根据筛选结果更新当前解码路径,包括:
    分别获取所述至少一个扩展路径的得分与当前各解码路径中的最高得分的差值;
    如果扩展路径的得分与所述最高得分的差值小于预设阈值,则将所述扩展路径作为新的当前解码路径。
  4. 如权利要求3所述的方法,其特征在于,所述在扩展过程中动态添加空白单元,包括:
    确定所述各解码路径当前扩展到的第一基本发音单元;
    为所述第一基本发音单元添加由所述第一基本发音单元跳转至所述空白单元、由所述空白单元跳转至自身的跳转路径,以生成针对所述第一基本发音单元添加空白单元之后的至少一个扩展路径。
  5. 如权利要求4所述的方法,其特征在于,还包括:
    根据所述当前特征向量帧分别获取所述空白单元的得分和所述第一基本发音单元的得分;
    如果所述空白单元的得分小于所述第一基本发音单元的得分,则在判断进入所述第一基本发音单元的扩展路径能否作为新的当前解码路径时,降低所述预设阈值。
  6. 如权利要求4所述的方法,其特征在于,还包括:
    判断所述至少一个扩展路径是否到达词尾;
    如果所述扩展路径到达词尾,则在判断所述到达词尾的扩展路径能否作为新的当前解码路径时,降低所述预设阈值。
  7. 一种语音识别装置,其特征在于,包括:
    接收模块,接收语音信号;
    解码模块,根据预先建立的声学模型、语言模型和解码网络对所述语音信号进行解码,并在解码过程中动态添加空白单元,以得到添加空白单元后的最优解码路径,其中,所述声学模型是基于连接时序分类训练得到的,所述声学模型中包括基本发音单元和所述空白单元,所述解码网络由所述基本发音单元构成的多个解码路径组成;
    输出模块,将所述最优解码路径输出为所述语音信号的识别结果。
  8. 如权利要求7所述的装置,其特征在于,所述解码模块包括:
    扩展单元,用于根据所述解码网络中的跳转路径,对当前各解码路径进行扩展;
    添加单元,用于在扩展过程中动态添加空白单元,以得到添加空白单元后的至少一个扩展路径;
    第一获取单元,用于根据从所述语音信号中提取的当前特征向量帧在所述声学模型和所述语言模型上分别获取所述至少一个扩展路径的得分;
    筛选单元,用于根据所述得分对所述至少一个扩展路径进行筛选,并根据筛选结果更新所述当前解码路径;
    选择单元,用于如果所述当前特征向量帧为所述语音信号的最后特征向量帧,则根据所述得分从所述更新后的当前解码路径中选择所述最优解码路径。
  9. 如权利要求8所述的装置,其特征在于,所述筛选单元用于:
    分别获取所述至少一个扩展路径的得分与当前各解码路径中的最高得分的差值;
    如果扩展路径的得分与所述最高得分的差值小于预设阈值,则将所述扩展路径作为新的当前解码路径。
  10. 如权利要求9所述的装置,其特征在于,所述添加单元用于:
    确定所述各解码路径当前扩展到的第一基本发音单元;
    为所述第一基本发音单元添加由所述第一基本发音单元跳转至所述空白单元、由所述 空白单元跳转至自身的跳转路径,以生成针对所述第一基本发音单元添加空白单元之后的至少一个扩展路径。
  11. 如权利要求10所述的装置,其特征在于,所述解码模块还包括:
    第二获取单元,用于根据所述当前特征向量帧分别获取所述空白单元的得分和所述第一基本发音单元的得分;
    第一控制单元,用于当所述空白单元的得分小于所述第一基本发音单元的得分时,在判断进入所述第一基本发音单元的扩展路径能否作为新的当前解码路径时,降低所述预设阈值。
  12. 如权利要求10所述的装置,其特征在于,所述解码模块还包括:
    判断单元,用于判断所述至少一个扩展路径是否到达词尾;
    第二控制单元,用于在所述扩展路径到达词尾时,在判断所述到达词尾的扩展路径能否作为新的当前解码路径时,则降低所述预设阈值。
  13. 一种电子设备,其特征在于,包括:
    一个或者多个处理器;
    存储器;
    一个或者多个程序,所述一个或者多个程序存储在所述存储器中,当被所述一个或者多个处理器执行时,执行如权利要求1-6任一项所述的语音识别方法。
  14. 一种非易失性计算机存储介质,其特征在于,所述计算机存储介质存储有一个或者多个程序,当所述一个或者多个程序被一个设备执行时,使得所述设备执行如权利要求1-6任一项所述的语音识别方法。
PCT/CN2016/091765 2015-12-14 2016-07-26 语音识别方法和装置 WO2017101450A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/758,159 US10650809B2 (en) 2015-12-14 2016-07-26 Speech recognition method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510925644.6A CN105529027B (zh) 2015-12-14 2015-12-14 语音识别方法和装置
CN201510925644.6 2015-12-14

Publications (1)

Publication Number Publication Date
WO2017101450A1 true WO2017101450A1 (zh) 2017-06-22

Family

ID=55771204

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/091765 WO2017101450A1 (zh) 2015-12-14 2016-07-26 语音识别方法和装置

Country Status (3)

Country Link
US (1) US10650809B2 (zh)
CN (1) CN105529027B (zh)
WO (1) WO2017101450A1 (zh)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105529027B (zh) 2015-12-14 2019-05-31 百度在线网络技术(北京)有限公司 语音识别方法和装置
JP6618884B2 (ja) * 2016-11-17 2019-12-11 株式会社東芝 認識装置、認識方法およびプログラム
CN106531158A (zh) * 2016-11-30 2017-03-22 北京理工大学 一种应答语音的识别方法及装置
CN108231089B (zh) * 2016-12-09 2020-11-03 百度在线网络技术(北京)有限公司 基于人工智能的语音处理方法及装置
CN108269568B (zh) * 2017-01-03 2021-07-30 中国科学院声学研究所 一种基于ctc的声学模型训练方法
JP6599914B2 (ja) * 2017-03-09 2019-10-30 株式会社東芝 音声認識装置、音声認識方法およびプログラム
CN107680587A (zh) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 声学模型训练方法和装置
CN111081226B (zh) * 2018-10-18 2024-02-13 北京搜狗科技发展有限公司 语音识别解码优化方法及装置
US11056098B1 (en) 2018-11-28 2021-07-06 Amazon Technologies, Inc. Silent phonemes for tracking end of speech
CN111477212B (zh) * 2019-01-04 2023-10-24 阿里巴巴集团控股有限公司 内容识别、模型训练、数据处理方法、系统及设备
CN111429889B (zh) * 2019-01-08 2023-04-28 百度在线网络技术(北京)有限公司 基于截断注意力的实时语音识别的方法、装置、设备以及计算机可读存储介质
CN110428819B (zh) * 2019-05-21 2020-11-24 腾讯科技(深圳)有限公司 解码网络生成方法、语音识别方法、装置、设备及介质
CN110322884B (zh) * 2019-07-09 2021-12-07 科大讯飞股份有限公司 一种解码网络的插词方法、装置、设备及存储介质
CN110444203B (zh) * 2019-07-17 2024-02-27 腾讯科技(深圳)有限公司 语音识别方法、装置及电子设备
CN110930979B (zh) * 2019-11-29 2020-10-30 百度在线网络技术(北京)有限公司 一种语音识别模型训练方法、装置以及电子设备
US11475167B2 (en) 2020-01-29 2022-10-18 International Business Machines Corporation Reserving one or more security modules for a secure guest
WO2022198474A1 (en) 2021-03-24 2022-09-29 Sas Institute Inc. Speech-to-analytics framework with support for large n-gram corpora
US11138979B1 (en) 2020-03-18 2021-10-05 Sas Institute Inc. Speech audio pre-processing segmentation
CN113539242A (zh) * 2020-12-23 2021-10-22 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机设备及存储介质
CN113808594A (zh) * 2021-02-09 2021-12-17 京东科技控股股份有限公司 编码节点处理方法、装置、计算机设备及存储介质
CN114067800B (zh) * 2021-04-28 2023-07-18 北京有竹居网络技术有限公司 语音识别方法、装置和电子设备
CN113707137B (zh) * 2021-08-30 2024-02-20 普强时代(珠海横琴)信息技术有限公司 解码实现方法及装置
KR102626954B1 (ko) * 2023-04-20 2024-01-18 주식회사 덴컴 치과용 음성 인식 장치 및 이를 이용한 방법

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1335978A (zh) * 1999-01-06 2002-02-13 D.S.P.C.科技有限公司 对噪声相对健全的语音识别系统和方法
CN103151039A (zh) * 2013-02-07 2013-06-12 中国科学院自动化研究所 一种基于向量机svm的说话者年龄段识别方法
CN105139864A (zh) * 2015-08-17 2015-12-09 北京天诚盛业科技有限公司 语音识别方法和装置
CN105529027A (zh) * 2015-12-14 2016-04-27 百度在线网络技术(北京)有限公司 语音识别方法和装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103794211B (zh) * 2012-11-02 2017-03-01 北京百度网讯科技有限公司 一种语音识别方法及系统
CN103065633B (zh) * 2012-12-27 2015-01-14 安徽科大讯飞信息科技股份有限公司 一种语音识别解码效率优化方法
CN103065630B (zh) * 2012-12-28 2015-01-07 科大讯飞股份有限公司 用户个性化信息语音识别方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1335978A (zh) * 1999-01-06 2002-02-13 D.S.P.C.科技有限公司 对噪声相对健全的语音识别系统和方法
CN103151039A (zh) * 2013-02-07 2013-06-12 中国科学院自动化研究所 一种基于向量机svm的说话者年龄段识别方法
CN105139864A (zh) * 2015-08-17 2015-12-09 北京天诚盛业科技有限公司 语音识别方法和装置
CN105529027A (zh) * 2015-12-14 2016-04-27 百度在线网络技术(北京)有限公司 语音识别方法和装置

Also Published As

Publication number Publication date
US10650809B2 (en) 2020-05-12
CN105529027A (zh) 2016-04-27
CN105529027B (zh) 2019-05-31
US20180254039A1 (en) 2018-09-06

Similar Documents

Publication Publication Date Title
WO2017101450A1 (zh) 语音识别方法和装置
CN107195295B (zh) 基于中英文混合词典的语音识别方法及装置
US10741170B2 (en) Speech recognition method and apparatus
JP6676141B2 (ja) 音声区間の検出方法および装置
CN107301860B (zh) 基于中英文混合词典的语音识别方法及装置
CN108711421B (zh) 一种语音识别声学模型建立方法及装置和电子设备
KR102167719B1 (ko) 언어 모델 학습 방법 및 장치, 음성 인식 방법 및 장치
KR102399535B1 (ko) 음성 인식을 위한 학습 방법 및 장치
US9600764B1 (en) Markov-based sequence tagging using neural networks
US20160260426A1 (en) Speech recognition apparatus and method
EP4018437B1 (en) Optimizing a keyword spotting system
CN105336322A (zh) 多音字模型训练方法、语音合成方法及装置
JP5310563B2 (ja) 音声認識システム、音声認識方法、および音声認識用プログラム
JP7008096B2 (ja) 関連付け関心点に基づく文推奨方法及び装置
CN106843523B (zh) 基于人工智能的文字输入方法和装置
JP2018159917A (ja) 音響モデルをトレーニングする方法及び装置
KR102167157B1 (ko) 발음 변이를 적용시킨 음성 인식 방법
JP2002215187A (ja) 音声認識方法及びその装置
WO2014176489A2 (en) A system and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis
JP6276513B2 (ja) 音声認識装置および音声認識プログラム
CN112259084A (zh) 语音识别方法、装置和存储介质
JP2012018403A (ja) パタン認識方法および装置ならびにパタン認識プログラムおよびその記録媒体
JP6026224B2 (ja) パターン認識方法および装置ならびにパターン認識プログラムおよびその記録媒体
JP3813491B2 (ja) 連続音声認識装置およびそのプログラム
KR101578766B1 (ko) 음성 인식용 탐색 공간 생성 장치 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16874511

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 15758159

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16874511

Country of ref document: EP

Kind code of ref document: A1