WO2019179285A1 - 语音识别方法、装置、设备及存储介质 - Google Patents

语音识别方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2019179285A1
WO2019179285A1 PCT/CN2019/076223 CN2019076223W WO2019179285A1 WO 2019179285 A1 WO2019179285 A1 WO 2019179285A1 CN 2019076223 W CN2019076223 W CN 2019076223W WO 2019179285 A1 WO2019179285 A1 WO 2019179285A1
Authority
WO
WIPO (PCT)
Prior art keywords
segment
frame
voice
candidate
speech
Prior art date
Application number
PCT/CN2019/076223
Other languages
English (en)
French (fr)
Inventor
林诗伦
张玺霖
麻文华
刘博�
李新辉
卢鲤
江修才
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2020542123A priority Critical patent/JP6980119B2/ja
Priority to EP19770634.4A priority patent/EP3770905A4/en
Publication of WO2019179285A1 publication Critical patent/WO2019179285A1/zh
Priority to US16/900,824 priority patent/US11450312B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the field of voice recognition, and in particular, to a voice recognition method, device, device, and storage medium.
  • Voice wake-up also known as Keyword Spotting (KWS)
  • WLS Keyword Spotting
  • the embodiment of the present application provides a voice recognition method, device, and device.
  • the technical solution is as follows:
  • the application example provides a voice recognition method, which is executed by a terminal or a server, and the method includes:
  • the candidate speech segment includes the preset keyword, it is determined that the voice information includes a preset keyword.
  • the application example also provides a voice wake-up method, including:
  • the terminal sends the obtained voice information to the server;
  • the server detects whether the voice information includes a preset keyword
  • the server intercepts a candidate voice segment in the voice information;
  • the candidate voice segment is a voice information segment corresponding to the preset keyword;
  • the server performs verification on the candidate voice segment, and detects whether the candidate keyword includes the preset keyword.
  • the terminal releases the sleep state and/or the lock screen state of the local machine according to the wake-up instruction.
  • the application example further provides a voice recognition device, the device comprising:
  • An acquisition module configured to obtain voice information
  • a processing module configured to determine, by using a weighted finite state machine network, a start and stop position of the candidate voice segment in the voice information; intercepting the candidate voice segment in the voice information according to the start and stop position; inputting the candidate voice segment
  • the machine learning model is used to detect whether the candidate speech segment includes the preset keyword; if the candidate speech segment includes the preset keyword, determining that the voice information includes a preset key word.
  • the present application also provides a speech recognition apparatus comprising a processor and a memory, the memory storing at least one instruction loaded by the processor and executed to implement a speech recognition method as described above.
  • the present application example also provides a computer readable storage medium having stored therein at least one instruction loaded by a processor and executed to implement a speech recognition method as described above.
  • FIG. 1A is an implementation environment diagram of a voice recognition method provided by an exemplary embodiment of the present application.
  • FIG. 1B is an implementation environment diagram of a voice recognition method provided by an exemplary embodiment of the present application.
  • FIG. 2 is a flowchart of a method for a voice recognition method provided by an exemplary embodiment of the present application
  • FIG. 3 is a flowchart of a method for a voice recognition method provided by an exemplary embodiment of the present application
  • FIG. 4 is a schematic diagram of voice information framing provided by an exemplary embodiment of the present application.
  • FIG. 5 is a block diagram of a weighted finite state machine network provided by an exemplary embodiment of the present application.
  • FIG. 6 is a flowchart of a method for a voice recognition method provided by an exemplary embodiment of the present application.
  • FIG. 7A is a block diagram of a convolutional neural network provided by an exemplary embodiment of the present application.
  • FIG. 7B is an overall architectural diagram of a voice recognition method provided by an exemplary embodiment of the present application.
  • FIG. 8 is a flowchart of a method for a voice recognition method provided by an exemplary embodiment of the present application.
  • FIG. 9 is an application scenario diagram of a voice recognition method provided by an exemplary embodiment of the present application.
  • FIG. 10 is an application scenario diagram of a voice recognition method provided by an exemplary embodiment of the present application.
  • FIG. 11 is a structural block diagram of a voice recognition apparatus according to an exemplary embodiment of the present application.
  • FIG. 12 is a structural block diagram of a voice recognition device provided by an exemplary embodiment of the present application.
  • Machine learning model It is an arithmetic model consisting of a large number of nodes (or neurons) connected to each other. Each node corresponds to a strategy function. The connection between each node represents a weighting for the signal passing through the connection. The value is called the weight. After the sample is input into the node of the machine learning model, an output result is output through each node, and the output result is used as an input sample of the next node, and the machine learning model adjusts the strategy function and weight of each node through the final output result of the sample, The process is called training.
  • Weighted finite state machine network is a mathematical model that represents a finite number of states and behaviors such as transitions and actions between these states.
  • the weighted finite state machine network includes an acoustic model, a dictionary, and a language model.
  • Acoustic model A mathematical model that outputs a hidden state corresponding to the maximum posterior probability according to the speech information.
  • the hidden state may be a phoneme or a speech unit smaller than a phoneme.
  • the acoustic model in the embodiment of the present application is a hidden Markov-depth neural network model.
  • Phoneme The smallest phonetic unit that is divided according to the natural attributes of the voice. From the perspective of acoustic properties, phonemes are the smallest unit of speech that is divided from the perspective of sound quality. From the physiological point of view, a pronunciation action forms a phoneme.
  • HMM Hidden Markov Model
  • Multilayer Perceptron A feedforward neural network that nonlinearly maps a set of input vectors to a set of output vectors. Multilayer perceptrons can be trained using backpropagation algorithms.
  • Deep Neural Network A machine learning model that is a multi-layer perceptron with more than two hidden layers.
  • each node is a neuron with a nonlinear activation function.
  • deep neural networks can be trained using backpropagation algorithms.
  • CNN Convolutional Neural Network
  • FC Fully Connected Layers
  • Softmax Soft maximization functions
  • Each layer of convolutional layer includes a layer of pooling layer.
  • Soft maximization function is also called the normalized exponential function, or Softmax function, which can "compress" a K-dimensional vector z containing any real number into another K-dimensional real vector ⁇ (z), so that each The range of an element is between (0, 1) and the sum of all elements is 1.
  • the voice recognition method includes: extracting features of the voice information, converting the voice information into corresponding text information through a Weighted Finite State Transducer (WFST) network, and detecting whether the text information includes a preset key word.
  • WFST Weighted Finite State Transducer
  • voice information without semantics but similar to the preset keywords, such as noise, background music sound, etc., is recognized as semantic voice information, thereby erroneously waking up the electronic device, resulting in identification The accuracy is low.
  • FIG. 1A and FIG. 1B an implementation environment diagram of a speech recognition method provided by an exemplary embodiment of the present application is shown.
  • FIG. 1A is a first possible implementation environment provided by an embodiment of the present application, where the implementation environment includes: a terminal 110 and a server 120.
  • the terminal 110 establishes a connection with the server 120 through a wired or wireless network.
  • the voice information is acquired by the terminal 110, and the voice information is recognized by the server 120 and the terminal 110 is instructed to release the sleep state and/or the lock screen state.
  • the silence detecting unit of the terminal 110 determines whether there is a user voice in the silent environment; if it is determined that there is a user voice, the recording unit is activated to record the user voice and obtain the corresponding original voice signal; and the original voice signal is sent to the server through a wired or wireless network. 120.
  • the server 120 performs preliminary extraction on the original voice signal to obtain voice information, and detects whether the voice information includes a preset keyword. If the voice information includes a preset keyword, the candidate voice segment is intercepted in the voice information, and the candidate voice segment is Presetting a voice information segment corresponding to the keyword; performing a second check on the candidate voice segment to detect whether the candidate voice segment includes a preset keyword; if the candidate voice segment includes the preset keyword, sending a wake-up command to the terminal 110 .
  • the terminal 110 After receiving the wake-up command sent by the server 120, the terminal 110 cancels the sleep state and/or the lock screen state of the local machine according to the wake-up command.
  • FIG. 1B is a second possible implementation environment provided by an embodiment of the present application, where the implementation environment includes: a terminal 110, a terminal 130, and a server 120.
  • the terminal 110 establishes a connection with the server 120 through a wired or wireless network
  • the terminal 130 establishes a connection with the server 120 through a wired or wireless network.
  • the voice information is acquired by the terminal 110, and the voice information is identified by the server 120 and the terminal 130 is instructed to release the sleep state and/or the lock screen state.
  • the silence detecting unit of the terminal 110 determines whether there is a user voice in the silent environment; if it is determined that there is a user voice, the recording unit is activated to record the user voice and obtain the corresponding original voice signal; and the original voice signal is sent to the server through a wired or wireless network. 120.
  • the server 120 performs preliminary extraction on the original voice signal to obtain voice information, and detects whether the voice information includes a preset keyword. If the voice information includes a preset keyword, the candidate voice segment is intercepted in the voice information, and the candidate voice segment is Presetting a voice information segment corresponding to the keyword; performing secondary verification on the candidate voice segment, detecting whether the candidate voice segment includes a preset keyword; and if the candidate voice segment includes the preset keyword, sending a wake-up command to the terminal 130 .
  • the terminal 130 After receiving the wake-up command sent by the server 120, the terminal 130 cancels the sleep state and/or the lock screen state of the device according to the wake-up command.
  • the voice information is acquired by the terminal 110, the voice information is identified, and the sleep state and/or the lock screen state of the local machine is released.
  • the silence detecting unit of the terminal 110 determines whether there is a user voice in the silent environment; if it is determined that there is a user voice, the recording unit is activated to record the user voice and obtain the original voice signal; and the original voice signal is initially extracted to obtain voice information; and the voice is detected; Whether the information includes a preset keyword; if the voice information includes a preset keyword, the candidate voice segment is intercepted in the voice information, and the candidate voice segment is a voice information segment corresponding to the preset keyword; The secondary check detects whether the candidate speech segment includes a preset keyword; if the candidate speech segment includes the preset keyword, the local sleep state and/or the lock screen state is cancelled.
  • the terminal may be an electronic device including a mute detecting unit and a recording unit, and may be a mobile phone, a tablet computer, an e-book reader, a laptop portable computer, a desktop computer, a smart speaker, an intelligent robot, an in-vehicle control center, or the like.
  • FIG. 2 shows a method flowchart of a voice recognition method provided by an exemplary embodiment of the present application.
  • the method can be used in the server 130 as shown in FIG. 1A and FIG. 1B, and can also be applied to the terminal.
  • the method includes:
  • Step 201 Acquire voice information.
  • the server receives the original voice signal sent by the terminal, and performs preliminary extraction of the original voice signal to obtain voice information.
  • the original voice signal is obtained by recording the original voice signal, and the original voice signal is sent to the server through a wired or wireless network, and the server receives the original voice signal.
  • Step 202 Determine a starting and ending position of the candidate voice segment in the voice information.
  • the server obtains the language information of the maximum posterior probability corresponding to the voice information by using the weighted finite state machine network. If the language information includes the preset keyword, determining the candidate voice segment corresponding to the preset keyword in the voice information. Starting and ending position.
  • the start and stop position is the time at which the candidate voice segment starts in the voice information and the end time; if the voice information is a frequency domain function, the start and stop position is the frequency at which the candidate voice segment starts in the voice information and The frequency of the end.
  • the candidate speech segment includes at least one frame of speech segments.
  • the preset keyword is “ON”
  • the language information of the maximum posterior probability corresponding to the voice information obtained by the server through the weighted finite state machine network includes “ON”, “ON” corresponds to the voice segment 1, and “Open” corresponds to the voice segment.
  • the starting time of the speech segment 1 is t1, the ending time is t2, the starting time of the speech segment 2 is t3, and the ending time is t4. If t1 is before t3 and t4 is after t2, the candidate speech segment is speech information.
  • the segment whose start time is t1 and whose end time is t4, that is, determines the start and end positions of the candidate speech segments in the voice information are t1 to t4.
  • Step 203 The candidate speech segment is intercepted in the voice information according to the start and end position of the candidate speech segment.
  • the server intercepts the candidate speech segments from the speech information according to the starting and ending positions of the candidate speech segments in the speech information.
  • Step 204 Enter the candidate speech segment into the machine learning model, and use the machine learning model to detect whether the candidate speech segment contains the preset keyword.
  • Machine learning models include convolutional neural networks or weighted finite state machine networks. After the server coarsely locates the candidate speech segments through the weighted finite state machine network, the candidate speech segments may be detected by the convolutional neural network, or the candidate speech segments may be detected by the weighted finite state machine network. Exemplarily, the server convolves the candidate speech segment by the first layer convolution layer in the convolutional neural network to obtain the first high-level semantic feature, and inputs the first high-level semantic feature into the first layer pooling layer to obtain the first time.
  • Compressed high-level semantic features input the compressed high-level semantic features into the second layer of convolutional layer, obtain the second high-level semantic features, and input the second high-level semantic features into the second-level pooling layer to obtain the high-level semantics of the second compression. Feature... After repeated iterations and pooling, the high-level semantic features of the candidate speech segments are extracted.
  • the server obtains the language information of the maximum posterior probability corresponding to the candidate speech segment by using the weighted finite state machine network, and detects whether the language information includes the preset keyword.
  • Step 205 If the candidate speech segment includes a preset keyword, determine that the voice information includes a preset keyword.
  • the server determines that the voice information includes the preset keyword.
  • the server determines that the voice information includes the preset keyword.
  • the detection of the candidate speech segments by the weighted finite state machine network takes a long time, and the accuracy of the candidate speech segments is relatively low compared with the convolutional neural network.
  • the candidate speech segment of the weighted finite state machine network rough positioning is verified by the machine learning model, and it is determined whether the candidate speech segment includes the preset keyword, and the related technology may be solved.
  • the speech information without semantics is recognized as the voice information with semantics, which leads to the problem of false wake-up, and the accuracy of speech recognition is improved.
  • FIG. 3 shows a flowchart of a method for a speech recognition method provided by an exemplary embodiment of the present application.
  • the method can be applied to the server 130 as shown in FIG. 1A and FIG. 1B, and can also be applied to the terminal.
  • the method can be an implementation of step 202 in the embodiment of FIG. 2, the method includes:
  • step 202a the voice information is framed to obtain a multi-frame voice segment.
  • the server framing the voice information through the moving window to obtain a multi-frame voice segment.
  • the moving window has a preset window length and a step length, and each frame of the voice segment has a corresponding starting and ending position and a serial number index.
  • the window length and the step length are in units of a preset time length. As shown in FIG. 4, the window length of the moving window 400 is 20 milliseconds, and the step length is 10 milliseconds, and the moving window 400 is used.
  • the voice information is divided into voice information of 20 milliseconds long and one frame, and the overlap length between the multi-frame voice segments is 10 milliseconds.
  • Step 202b Input the multi-frame speech segment into the weighted finite state machine network to obtain the language information of the maximum a posteriori probability corresponding to the multi-frame speech segment.
  • the weighted finite state machine network includes an acoustic model, a dictionary, and a language model.
  • the acoustic model can be composed of a deep neural network and a hidden Markov model.
  • the deep neural network includes at least two cascaded deep neural network layers and a fully connected layer, and a mathematical model of the posterior probability of the hidden state corresponding to the speech segment is output according to the input speech segment.
  • V in Fig. 5 represents the speech segment of the input depth neural network
  • W represents the parameters of each layer of the neural network layer in the deep neural network layer.
  • W1 represents the parameters of the first layer neural network layer
  • WM represents the Mth layer neural network.
  • h(i) represents the output of the i-th neural network layer in the deep neural network layer, for example, h(1) represents the output result of the first layer neural network layer, and h(M) represents the Mth
  • Si represents the i-th hidden state, for example, the first hidden state S1, the Kth hidden state SK
  • asisj represents the relationship between the i-th hidden state Si and the j-th hidden state Sj
  • the transition probability for example, as1s2 represents the transition probability between the first hidden state S1 and the second hidden state S2.
  • the hidden Markov model is a mathematical model for outputting the hidden state corresponding to the speech segment based on the posterior probability of the hidden state corresponding to the speech segment.
  • a dictionary is a correspondence between a phoneme and a word. Entering at least one phoneme into the dictionary may result in a word or word having a maximum posterior probability corresponding to at least one phoneme.
  • a language model is a correspondence between words and syntax and/or grammar.
  • the word or word is input into the language model to obtain the language information of the maximum posterior probability corresponding to the word, wherein the language information may be a word or a sentence.
  • the server inputs the multi-frame speech segment into the deep neural network to extract the feature, and obtains the posterior probability of the hidden state corresponding to each frame of the speech segment, according to the posterior probability of the hidden state corresponding to each frame of the speech segment, through the hidden Markov
  • the model obtains a hidden state corresponding to each frame of the speech segment, obtains a phoneme corresponding to the multi-frame speech segment according to the hidden state corresponding to each frame of the speech segment, and obtains a word or word with a maximum a posteriori probability corresponding to the multi-frame speech segment through the dictionary, according to The word or word of the maximum a posteriori probability corresponding to the multi-frame speech segment, and the language information of the maximum posterior probability corresponding to the multi-frame speech segment by the language model.
  • the multi-frame speech segment is input to the weighted finite state machine network, which is the language information of the maximum a posteriori probability corresponding to the multi-frame speech segment.
  • Step 202c If the language information includes a preset keyword, obtain a start and end position of the candidate voice segment corresponding to the preset keyword in the voice information. In this step, the start and end positions of the candidate speech segments corresponding to the preset keywords in the voice information are determined according to the phonemes corresponding to the preset keywords.
  • the server detects whether the language information of the maximum posterior probability corresponding to the multi-frame speech segment includes a preset keyword, and if the language information includes the preset keyword, the start and end of the candidate speech segment corresponding to the preset keyword are obtained in the voice information. position.
  • one frame of the speech segment corresponds to a hidden state
  • at least one hidden state corresponds to one phoneme
  • at least one phoneme corresponds to one word
  • the candidate speech segment corresponding to the keyword is obtained by presetting the phoneme corresponding to each word in the keyword. Since each voice segment is indexed with a sequence number when framing the voice information, and each voice segment has a start and stop position attribute, the start and end position of the candidate voice segment in the voice information can be obtained.
  • the multi-frame speech segment is input to the weighted finite state machine network to obtain the language information of the maximum posterior probability corresponding to the multi-frame speech segment, and if the language information includes the preset keyword, Obtaining the starting and ending positions of the candidate speech segments corresponding to the preset keywords in the voice information can improve the accuracy of the recognition of the candidate speech segments.
  • the posterior probability of the hidden state corresponding to each segment of the speech segment is obtained by inputting the multi-frame speech segment into the deep neural network, and the deep neural network has a strong feature of extracting features, so The posterior probability of the hidden state corresponding to each frame of the speech segment obtained by the network is more accurate, thereby improving the accuracy of the recognition of the candidate speech segment.
  • FIG. 6 illustrates a method flowchart of a voice recognition method provided by an exemplary embodiment of the present application.
  • the method can be applied to the server 130 as shown in FIG. 1A and FIG. 1B, and can also be applied to the terminal.
  • the method can be an implementation of step 204 in the embodiment of FIG. 2, the method includes:
  • step 204a the candidate speech segments are input into the convolutional neural network.
  • the server After the server obtains the candidate speech segment by the method in the embodiment of FIG. 2 or the embodiment of FIG. 3, the candidate speech segment is input into the convolutional neural network.
  • the convolutional neural network includes at least two layers of convolutional layers, one fully connected layer, and one soft maximization function, each of which further includes a layer of pooling layers.
  • a two-layer convolutional layer is taken as an example, and it does not mean that the convolutional neural network only includes two convolutional layers.
  • Step 204b convolution and pooling the candidate speech segments by the convolutional neural network to obtain high-level semantic features of the candidate speech segments.
  • the server convolves the candidate speech segment by the first layer convolution layer in the convolutional neural network to obtain the first high-level semantic feature, and inputs the first high-level semantic feature into the first layer pooling layer to obtain the first time.
  • Compressed high-level semantic features input the compressed high-level semantic features into the second layer of convolutional layer, obtain the second high-level semantic features, and input the second high-level semantic features into the second-level pooling layer to obtain the high-level semantics of the second compression.
  • Feature After repeated iterations and pooling, the high-level semantic features of the candidate speech segments are extracted.
  • Step 204c classify high-level semantic features of the candidate speech segments by using a fully connected layer and a soft maximization function in the convolutional neural network, and detect whether the candidate speech segments include preset keywords.
  • the candidate speech segments are processed by the multi-layer convolution layer and the pooling layer to obtain high-level semantic features, and the high-level semantic features extracted by each convolution layer and the pooling layer are connected by the all-connection layer, and then transmitted to The soft maximization function classifies the high-level semantic features and outputs whether the candidate speech segments contain the results of the preset keywords.
  • a multi-frame speech segment is input to an acoustic model to obtain a phoneme of a maximum posterior probability corresponding to a multi-frame speech segment, and a maximum corresponding to a multi-frame speech segment is obtained through a dictionary.
  • the word or word of the posterior probability, the word or sentence of the maximum posterior probability corresponding to the multi-frame speech segment is obtained by the language model, thereby detecting whether the word or sentence contains the preset keyword, and if so, intercepting the preset keyword corresponding
  • the candidate speech segment inputs the candidate speech segment into the convolutional neural network for verification, and outputs the final verification result.
  • the high-level semantic features of the candidate speech segments are obtained by inputting the candidate speech segments into the convolutional neural network after convolution and pooling, and the high-level semantic features extracted by the fully connected layer are obtained. Connected to the soft maximization function for classification, and obtains the result of whether the candidate speech segment contains the preset keyword. Since the candidate speech segment is obtained through the preliminary positioning of the weighted finite state machine network, the recognition rate is improved based on the recognition rate. The accuracy of speech recognition.
  • FIG. 8 is a flowchart of a method for a voice recognition method provided by an exemplary embodiment of the present application.
  • the method can be applied to an implementation environment as shown in FIG. 1A, the method comprising:
  • Step 801 The terminal sends the acquired original voice signal to the server.
  • the terminal silence detection module determines whether there is a user voice. If it is determined that there is a user voice, the silence detection module is activated to record the user voice and obtain a corresponding original voice signal, and the original voice signal is sent to the server through a wired or wireless network. .
  • step 802 the server performs preliminary extraction on the original voice signal to obtain voice information.
  • the server performs preliminary extraction on the received original voice signal to obtain voice information, which is a function of time domain or frequency domain.
  • step 803 the server divides the voice information into frames to obtain a multi-frame voice segment.
  • the server framing the voice information through the moving window to obtain a multi-frame voice segment.
  • the moving window has a preset window length and a step length, and each frame of the voice segment has a corresponding starting and ending position and a serial number index.
  • Step 804 The server inputs the multi-frame speech segment into the deep neural network to obtain a posterior probability between each frame segment of the multi-frame speech segment and the corresponding hidden state.
  • the depth neural network outputs the posterior probability between the speech segment of each frame and the corresponding hidden state. Therefore, the hidden state corresponding to each segment of the speech segment cannot be obtained through the deep neural network, and the speech segment needs to be passed for each frame.
  • the hidden Markov model performs forward decoding.
  • Step 805 The server converts the posterior probability of the hidden state corresponding to each frame of the voice segment by using a Bayesian formula, and obtains a transmission probability of the hidden state corresponding to each frame of the voice segment.
  • forward decoding is performed by a hidden Markov model, and the probability of transmission of the hidden state corresponding to the speech segment is required.
  • the server converts the posterior probability of the hidden state corresponding to each frame of the speech segment by the Bayesian formula, and obtains the transmission probability of the hidden state corresponding to each frame of the speech segment.
  • Step 806 The server performs the pre-existence of the hidden Markov model according to the transmission probability of the hidden state corresponding to each frame of the speech segment, the initial probability of each hidden state in the hidden Markov model, and the transition probability between each hidden state.
  • the hidden state of the maximum posterior probability corresponding to the multi-frame speech segment is obtained by decoding.
  • the initial probability of each hidden state in the hidden Markov model and the transition probability between each hidden state are already trained parameters. According to the transmission probability of the hidden state corresponding to each frame of the speech segment obtained in step 804, combined with the initial probability of each hidden state and the transition probability between each hidden state, the speech segment of each frame is obtained by the hidden Markov model. Forward decoding is performed to obtain a hidden state of the maximum posterior probability corresponding to the multi-frame speech segment.
  • Step 807 The server obtains a phoneme corresponding to the multi-frame speech segment according to a hidden state corresponding to each frame of the voice segment.
  • the phoneme is composed of at least one hidden state
  • the server obtains the phoneme corresponding to the multi-frame speech segment according to the hidden state corresponding to each frame of the speech segment.
  • Step 808 The server obtains the language information of the maximum posterior probability corresponding to the multi-frame speech segment according to the phoneme corresponding to the multi-frame speech segment, and the dictionary and the language model.
  • a word consists of at least one phoneme, and the dictionary contains the correspondence between words and phonemes.
  • the server obtains the maximum posterior probability word or word corresponding to the multi-frame speech segment through the dictionary, and the maximum posterior probability corresponding to the multi-frame speech segment according to the language or the word according to the maximum posterior probability word or word corresponding to the multi-frame speech segment.
  • Language information may be a word or a sentence, and the language model is a correspondence between words and grammar and/or syntax.
  • the correspondence between words and phonemes in the above dictionary, and the correspondence between words and grammar and/or syntax in the language model is a probability correspondence relationship.
  • the server obtains multi-frame speech through the dictionary and the language model according to the phonemes corresponding to the multi-frame speech segments.
  • the language information of the maximum a posteriori probability corresponding to the segment is the language information of the maximum posterior probability corresponding to the multi-frame speech segment obtained by finding the path according to the maximum probability.
  • Step 809 If the language information includes a preset keyword, the server acquires a starting and ending position of the candidate voice segment corresponding to the preset keyword in the voice information. In this step, the server determines, according to the phoneme corresponding to the preset keyword, the starting and ending position of the candidate voice segment corresponding to the preset keyword in the voice information.
  • the server detects whether the language information of the maximum posterior probability corresponding to the multi-frame speech segment includes a preset keyword, and if it is determined that the language information includes the preset keyword, determining the start and end of the candidate speech segment corresponding to the preset keyword in the voice information position. If the language information of the maximum posterior probability corresponding to the multi-frame speech segment does not include the preset keyword, the step is stopped.
  • Step 810 The server intercepts the candidate voice segment in the voice information according to the start and end position of the candidate voice segment in the voice information.
  • the server intercepts the candidate speech segments from the speech information according to the starting and ending positions of the candidate speech segments in the speech information.
  • Step 811 The server inputs the candidate speech segments into the convolutional neural network, and performs convolution and pooling extraction on the candidate speech segments by the convolutional neural network to obtain high-level semantic features of the candidate speech segments.
  • the server inputs the intercepted candidate speech segments into the convolutional neural network, and convolves the candidate speech segments through the first layer convolution layer in the convolutional neural network to obtain the first high-level semantic features.
  • the first high-level semantic feature is input into the first layer pooling layer to obtain a compressed high-level semantic feature
  • the first compressed high-level semantic feature is input into the second layer convolution layer to obtain the second high-level semantic feature
  • the second high-level semantic feature is obtained.
  • Input to the second layer of the pooling layer to obtain the high-level semantic features of the second compression... After repeated iterations and pooling, the high-level semantic features of the candidate speech segments are extracted.
  • Step 812 The server classifies high-level semantic features of the candidate speech segments by using a fully connected hierarchy and a soft maximization function in the convolutional neural network, and detects whether the candidate speech segments include the preset keywords.
  • the candidate speech segments are processed by the multi-layer convolution layer and the pooling layer to obtain high-level semantic features, and the high-level semantic features extracted by each convolution layer and the pooling layer are connected by the all-connection layer, and then transmitted to The soft maximization function classifies the high-level semantic features and outputs whether the candidate speech segments contain the results of the preset keywords.
  • Step 813 If the candidate speech segment includes a preset keyword, the server sends a wake-up instruction to the terminal.
  • the server sends a wake-up command to the terminal through a wired or wireless network.
  • step 814 the terminal releases the sleep state and/or the lock screen state of the local machine according to the wake-up instruction.
  • the terminal After receiving the wake-up command sent by the server, the terminal cancels the sleep state and/or the lock screen state of the local machine according to the wake-up command.
  • the candidate voice segment of the weighted finite state machine network coarsely located is verified by the convolutional neural network, and it is determined whether the candidate speech segment includes the preset keyword, and the related technology may be solved.
  • the problem that voice information without semantics is recognized as semantic voice information leads to false wake-up, and the accuracy of voice recognition is improved.
  • the multi-frame speech segment is input to the weighted finite state machine network to obtain the language information of the maximum posterior probability corresponding to the multi-frame speech segment, and if the language information includes the preset keyword, the pre-preparation is obtained.
  • the starting and ending positions of the candidate speech segments corresponding to the keywords in the speech information can improve the accuracy of the recognition of the candidate speech segments.
  • the posterior probability of the hidden state corresponding to each segment of the speech segment is obtained by inputting the multi-frame speech segment into the deep neural network, and the deep neural network has a strong feature of extracting features, so The posterior probability of the hidden state corresponding to each frame of the speech segment obtained by the network is more accurate, thereby improving the accuracy of the recognition of the candidate speech segment.
  • the candidate speech segment is input to the convolutional neural network, and the high-level semantic features of the candidate speech segment are extracted by convolution and pooling, and the extracted high-level semantic features are connected through the fully connected layer.
  • the function is sent to the soft maximization function to obtain the result of whether the candidate speech segment contains the preset keyword. Since the candidate speech segment is obtained through the preliminary positioning of the weighted finite state machine network, the speech recognition is improved based on the guaranteed recognition rate. The accuracy rate.
  • FIGS. 9 and 10 illustrate an application scenario of a speech recognition method provided by an exemplary embodiment of the present application.
  • the smart robot 910, the smart speaker 920, the smart mobile phone 930, and the like transmit the acquired voice information to the cloud through a wired or wireless network, and the cloud detects each voice information by using the method in the foregoing embodiment. Whether the corresponding preset keyword is included, and if the preset keyword is included, a wake-up command is sent to the corresponding terminal, and the terminal is released from the sleep and/or lock screen state.
  • FIG. 10 provides an offline voice recognition application scenario.
  • the user 1010 speaks an awakening word (ie, a preset keyword) to the electronic device 1020.
  • the electronic device 1010 After detecting that the user speaks the wake-up word, the electronic device 1010 obtains the original voice signal and passes the recording. Performing a preliminary extraction feature on the original voice signal, and detecting whether the voice information includes a preset keyword in the foregoing embodiment, and if the preset keyword is included, canceling the sleep and/or lock screen state of the local device.
  • the computing resources of an electronic device are relatively limited, and the electronic device needs to be customized for different hardware.
  • the customized process is: the electronic device manufacturer submits the hardware resource that the electronic device can allocate to the voice wake-up module; after the server receives the data submitted by the manufacturer, according to the data
  • the electronic device can divide the hardware resource status to design a model that can be run on the electronic device; use the training data training model that is compatible with the electronic device application environment; perform joint testing and targeted tuning on the obtained model, and then send the post Integration with the electronics manufacturer; after the integration, the user can wake up the electronic device in the offline environment, and the wake-up method is the same as the online service.
  • FIG. 11 is a structural block diagram of a voice recognition apparatus provided by an exemplary embodiment of the present application. As shown, the apparatus is applicable to the terminal 110 , the terminal 120 , or the server 130 as shown in FIG . 1 .
  • the device includes an acquisition module 1110 and a processing module 1120:
  • the obtaining module 1110 is configured to obtain voice information.
  • the processing module 1120 is configured to determine, by using a weighted finite state machine network, a start and stop position of the candidate voice segment in the voice information; intercept the candidate voice segment in the voice information according to the start and stop position; input the candidate voice segment into the machine learning model, and pass the machine learning model Detecting whether the candidate speech segment includes a preset keyword; if the candidate speech segment includes the preset keyword, determining that the voice information includes the preset keyword.
  • the processing module 1110 is further configured to frame the voice information to obtain a multi-frame voice segment, and input the multi-frame voice segment into the weighted finite state machine network to obtain language information of a maximum a posteriori probability corresponding to the multi-frame voice segment.
  • the obtaining module 1120 is further configured to: if the language information includes a preset keyword, determine a start and stop position of the candidate voice segment corresponding to the preset keyword in the voice information; the candidate voice segment includes at least one frame voice in the multi-frame voice segment Fragment.
  • the weighted finite state machine network includes a deep neural network, a hidden Markov model, a dictionary, and a language model;
  • the processing module 1120 is further configured to input the multi-frame speech segment into the deep neural network to obtain a posterior probability of the hidden state corresponding to each frame of the multi-frame speech segment; according to the hidden state corresponding to each frame of the speech segment The posterior probability, the hidden state corresponding to each frame of the speech segment is obtained by the hidden Markov model; the phoneme corresponding to the multi-frame speech segment is obtained according to the hidden state corresponding to each frame of the speech segment; according to the phoneme corresponding to the multi-frame speech segment, combined
  • the dictionary and the language model obtain language information of a maximum a posteriori probability corresponding to the multi-frame speech segment; wherein the dictionary includes a correspondence between the phoneme and a word, the language model including a correspondence between the word and a grammar and/or a syntax relationship.
  • the processing module 1120 is further configured to convert, by using a Bayesian formula, a posterior probability of a hidden state corresponding to each frame of the voice segment, to obtain a transmission probability of a hidden state corresponding to each frame of the voice segment; and corresponding to each frame of the voice segment.
  • the processing module 1120 is further configured to input candidate speech segments into the convolutional neural network; convolution and pooling extraction of the candidate speech segments by the convolutional neural network to obtain high-level semantic features of the candidate speech segments;
  • the fully connected layer and the soft maximization function classify high-level semantic features of the candidate speech segments, and detect whether the candidate speech segments contain the preset keywords.
  • FIG. 12 it is a structural block diagram of a voice processing device provided by an exemplary embodiment of the present application.
  • the device includes a processor 1210 and a memory 1220.
  • the processor 1210 may be a central processing unit (CPU), a network processor (in English: network processor, NP), or a combination of a CPU and an NP.
  • the processor 1210 may further include a hardware chip.
  • the hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), and a general array logic (GAL). Or any combination thereof.
  • the memory 1220 is coupled to the processor 1210 by a bus or other means.
  • the memory 1220 stores at least one instruction, at least one program, code set or instruction set, and the at least one instruction, the at least one program, the code set or the instruction set is processed by the processor 1210. Loaded and executed to implement the speech processing method of FIG. 2, FIG. 3, FIG. 6, or FIG.
  • the memory 1220 may be a volatile memory, a non-volatile memory, or a combination thereof.
  • Volatile memory can be random access memory (RAM), such as static random access memory (SRAM), dynamic random access memory (English: dynamic random access memory , DRAM).
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • the non-volatile memory can be a read only memory image (ROM), such as a programmable read only memory (PROM), an erasable programmable read only memory (English: erasable) Programmable read only memory (EPROM), electrically erasable programmable read-only memory (EEPROM).
  • ROM read only memory image
  • PROM programmable read only memory
  • EPROM erasable programmable read only memory
  • EEPROM electrically erasable programmable read-only memory
  • the non-volatile memory can also be a flash memory (English: flash memory), a magnetic memory, such as a magnetic tape (English: magnetic tape), a floppy disk (English: floppy disk), a hard disk.
  • the non-volatile memory can also be an optical disc.
  • the application further provides a computer readable storage medium, where the storage medium stores at least one instruction, at least one program, a code set or a set of instructions, the at least one instruction, the at least one program, the code set or The set of instructions is loaded and executed by the processor to implement the voice processing method provided by the above method embodiments.
  • the present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the speech processing methods described in the various aspects above.
  • a plurality as referred to herein means two or more.
  • "and/or” describing the association relationship of the associated objects, indicating that there may be three relationships, for example, A and/or B, which may indicate that there are three cases where A exists separately, A and B exist at the same time, and B exists separately.
  • the character "/" generally indicates that the contextual object is an "or" relationship.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Pure & Applied Mathematics (AREA)
  • Biophysics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Algebra (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种语音识别方法、装置及设备,属于语音识别领域。方法包括:获取语音信息(201);通过加权有限状态机网络确定语音信息中的候选语音片段的起止位置(202);根据候选语音片段的起止位置在语音信息中截取该候选语音片段(203);将候选语音片段输入机器学习模型中,通过机器学习模型检测候选语音片段是否包含预设关键词(204)。通过机器学习模型对加权有限状态机网络粗定位的候选语音片段进行校验,确定候选语音片段是否包含预设关键词,解决了相关技术中可能会将没有语义的语音信息识别为具有语义的语音信息从而导致误唤醒的问题,提高了语音识别的准确率。

Description

语音识别方法、装置、设备及存储介质
本申请要求于2018年3月22日提交中国专利局、申请号为201810240076.X、申请名称为“语音识别方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音识别领域,特别涉及一种语音识别方法、装置、设备及存储介质。
背景技术
语音唤醒,也被称为关键词唤醒(Keyword Spotting,KWS),是处于休眠或锁屏状态的电子设备通过识别用户语音,确定用户语音中包含预设关键词时,解除休眠和/或锁屏状态的功能,进而开启语音交互操作。在语音唤醒过程中,语音识别是较为关键的步骤。
技术内容
本申请实施例提供了一种语音识别方法、装置及设备所述技术方案如下:
本申请实例提供了一种语音识别方法,由终端或服务器执行,所述方法包括:
获取语音信息;
通过加权有限状态机网络确定所述语音信息中的候选语音片段和所述候选语音片段的起止位置;
根据所述起止位置在所述语音信息中截取所述候选语音片段;
将所述候选语音片段输入机器学习模型中,通过所述机器学习模型检测所述候选语音片段是否包含所述预设关键词;
若所述候选语音片段包含所述预设关键词,则确定所述语音信息包含预设关键词。
本申请实例还提供了一种语音唤醒方法,包括:
终端将获取到的语音信息发送至服务器;
所述服务器检测所述语音信息中是否包含预设关键词;
若所述语音信息中包含所述预设关键词,则所述服务器在所述语音信息中截取候选语音片段;所述候选语音片段是所述预设关键词对应的语音信息片段;
所述服务器对所述候选语音片段进行校验,再次检测所述候选语音片段中是否包含所述预设关键词;
若所述候选语音片段中包含所述预设关键词,则向所述终端发送唤醒指令;
所述终端根据所述唤醒指令解除所述本机的休眠状态和/或锁屏状态。
本申请实例还提供了一种语音识别装置,所述装置包括:
获取模块,用于获取语音信息;
处理模块,用于通过加权有限状态机网络确定所述语音信息中的候选语音片段的起止位置;根据所述起止位置在所述语音信息中截取所述候选语音片段;将所述候选语音片段输入机器学习模型中,通过所述机器学习模型检测所述候选语音片段是否包含所述预设关键词;若所述候选语音片段包含所述预设关键词,则确定所述语音信息包含预设关键词。
本申请实例还提供了一种语音识别设备,包括处理器和存储器,所述存储器中存储有至少一条指令,所述至少一条指令由所述处理器加载并执行以实现如上述的语音识别方法。
本申请实例还提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条指令,至少一条指令由处理器加载并执行以实现如上述的语音识别方法。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1A是本申请一个示例性的实施例提供的语音识别方法的实施环境图;
图1B是本申请一个示例性的实施例提供的语音识别方法的实施环境图;
图2是本申请一个示例性的实施例提供的语音识别方法的方法流程图;
图3是本申请一个示例性的实施例提供的语音识别方法的方法流程图;
图4是本申请一个示例性的实施例提供的语音信息分帧示意图;
图5是本申请一个示例性的实施例提供的加权有限状态机网络的构架图;
图6是本申请一个示例性的实施例提供的语音识别方法的方法流程图;
图7A是本申请一个示例性的实施例提供的卷积神经网络的构架图;
图7B是本申请一个示例性的实施例提供的语音识别方法的整体构架图;
图8是本申请一个示例性的实施例提供的语音识别方法的方法流程图;
图9是本申请一个示例性的实施例提供的语音识别方法的应用场景图;
图10是本申请一个示例性的实施例提供的语音识别方法的应用场景图;
图11是本申请一个示例性的实施例提供的语音识别装置的结构框图;
图12是本申请一个示例性的实施例提供的语音识别设备的结构框图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
为了方便理解,下面对本申请实施例中涉及的名词进行解释。
机器学习模型:是一种运算模型,由大量的节点(或称神经元)之间相互联接构成,每个节点对应一个策略函数,每两个节点间的连接代表一个对于通过该连接信号的加权值,称之为权重。样本输入机器学习模型的节点后,通过每个节点输出一个输出结果,该输出结果作为下一个节点的输入样本,机器学习模型通过样本最终输出结果对每个节点的策略函数和权重进行调整,该过程被称为训练。
加权有限状态机网络:是表示有限个状态以及这些状态之间的转移和动作等行为的数学模型。本申请实施例中,加权有限状态机网络包括声学模型、词典和语言模型。
声学模型:是根据语音信息输出对应的最大后验概率的隐藏状态的数学模型,隐藏状态可以是音素,也可以是比音素更小的语音单位。本申请实施例中的声学模型是隐马尔可夫-深度神经网络模型。
音素:是根据语音的自然属性划分出来的最小语音单位。从声学性质来看,音素是从音质角度划分出来的最小语音单位。从生理性质来看,一个发音动作形成一个音素。
隐马尔可夫模型(Hidden Markov Model,HMM):是一种统计分析模型,用来描述一个含有隐含未知参数的马尔可夫过程。在隐马尔可夫模型中,状态并不是直接可见的,受状态影响的某些变量是可见的。
多层感知机(Multilayer Perceptron,MLP):是一种前馈神经网络,将一组输入向量非线性映射到一组输出向量。多层感知机可以使用反向传播算法进行训练。
深度神经网络(Deep Neural Network,DNN):是一种机器学习模型,是包含超过两个隐藏层的多层感知机。除了输入节点外,每个节点都是一个带有非线性激活函数的神经元,与多层感知机一样,深度神经网络可以使用反向传播算法进行训练。
卷积神经网络(Convolutional Neural Network,CNN):是一种机器学习模型,包括至少两层级联的卷积层、顶端的全连接层(Fully Connected Layers,FC)和软最大化函数(Softmax)组成,每一层卷积层后包括一层池化层。其中,软最大化函数也称为归一化指数函数,或称Softmax函数,它能将一个含任意实数的K维向量z“压缩”到另一个K维实向量σ(z)中,使得每一个元素的范围都在(0,1)之间,并且所有元素的和为1。
其通过共享参数降低模型的参数量,使之在图像和语音识别方面得到广泛应用。
在一些实例中,语音识别方法包括:对语音信息进行提取特征,通过加权有限状态机(Weighted Finite State Transducer,WFST)网络将语音信息转换为对应的文本信息,检测文本信息中是否包含预设关键词。
在将语音信息转换为对应的文本信息的过程中,需要对语音信息进行语义识别。由于加权有限状态机网络的局限性,会将没有语义但和预设关键词相似的语音信息,例如噪声、背景音乐声等,识别为具有语义的语音信息,从而将电子设备误唤醒,导致识别准确率较低。
请参考图1A和图1B,其示出了本申请一个示例性实施例提供的语音识别方法 的实施环境图。
图1A是本申请实施例提供的第一种可能的实施环境,该实施环境包括:终端110以及服务器120。其中,终端110通过有线或无线网络和服务器120建立连接。
在本实施例中,由终端110获取语音信息,由服务器120对语音信息进行识别并指示终端110解除休眠状态和/或锁屏状态。
终端110的静音检测单元判断静音环境中是否有用户声音;若确定有用户声音,则激活录音单元对用户声音进行录音并得到相应的原始语音信号;将原始语音信号通过有线或无线网络发送至服务器120。
服务器120对原始语音信号进行初步提取特征得到语音信息,检测语音信息中是否包含预设关键词;若语音信息中包含预设关键词,则在语音信息中截取候选语音片段,该候选语音片段是预设关键词对应的语音信息片段;对候选语音片段进行二次校验,检测候选语音片段中是否包含预设关键词;若候选语音片段中包含预设关键词,则向终端110发送唤醒指令。
终端110接收到服务器120发送的唤醒指令后,根据该唤醒指令解除本机的休眠状态和/或锁屏状态。
图1B是本申请实施例提供的第二种可能的实施环境,该实施环境包括:终端110、终端130以及服务器120。其中,终端110通过有线或无线网络和服务器120建立连接,终端130通过有线或无线网络和服务器120建立连接。该实施例中,由终端110获取语音信息,由服务器120对语音信息进行识别并指示终端130解除休眠状态和/或锁屏状态。
终端110的静音检测单元判断静音环境中是否有用户声音;若确定有用户声音,则激活录音单元对用户声音进行录音并得到相应的原始语音信号;将原始语音信号通过有线或无线网络发送至服务器120。
服务器120对原始语音信号进行初步提取特征得到语音信息,检测语音信息中是否包含预设关键词;若语音信息中包含预设关键词,则在语音信息中截取候选语音片段,该候选语音片段是预设关键词对应的语音信息片段;对候选语音片段进行二次校验,检测候选语音片段中是否包含预设关键词;若候选语音片段中包含预设关键词,则向终端130发送唤醒指令。
终端130接收到服务器120发送的唤醒指令后,根据该唤醒指令解除本机的休 眠状态和/或锁屏状态。
在一个实施例中,由终端110获取语音信息,对语音信息进行识别并解除本机的休眠状态和/或锁屏状态。
终端110的静音检测单元判断静音环境中是否有用户声音;若确定有用户声音,则激活录音单元对用户声音进行录音并得到原始语音信号;对原始语音信号进行初步提取特征得到语音信息;检测语音信息中是否包含预设关键词;若语音信息中包含预设关键词,则在语音信息中截取候选语音片段,该候选语音片段是预设关键词对应的语音信息片段;对候选语音片段进行二次校验,检测候选语音片段中是否包含预设关键词;若候选语音片段中包含预设关键词,则解除本机的休眠状态和/或锁屏状态。
上述终端可以是包含静音检测单元和录音单元的电子设备,可以是手机、平板电脑、电子书阅读器、膝上型便携计算机、台式计算机、智能音箱、智能机器人、车载控制中心等等。
请参考图2,其示出了本申请一个示例性实施例提供的语音识别方法的方法流程图。该方法可以用于如图1A和图1B中所示的服务器130中,也可以应用于终端中,该方法包括:
步骤201,获取语音信息。
服务器接收终端发送的原始语音信号,将原始语音信号进行初步提取特征后,得到语音信息。
示例性的,终端确定有用户声音后,对用户声音录音得到原始语音信号,将原始语音信号通过有线或无线网络发送至服务器,服务器接收该原始语音信号。
步骤202,确定语音信息中的候选语音片段的起止位置。
示例性的,服务器通过加权有限状态机网络获取语音信息对应的最大后验概率的语言信息,若语言信息中包含预设关键词,则确定预设关键词对应的候选语音片段在语音信息中的起止位置。
若语音信息是时域函数,起止位置是候选语音片段在语音信息中起始的时刻和结束的时刻;若语音信息是频域函数,起止位置是候选语音片段在语音信息中起始的频率和结束的频率。
候选语音片段包含至少一帧语音片段。例如,预设关键词是“开启”,服务器通过加权有限状态机网络获取语音信息对应的最大后验概率的语言信息中包含“开启”,“开”对应语音片段1,“启”对应语音片段2,语音片段1的起始时刻为t1,结束时刻为t2,语音片段2的起始时刻为t3,结束时刻为t4,若t1在t3之前,t4在t2之后,则候选语音片段为语音信息中起始时刻为t1,结束时刻为t4的片段,即,确定候选语音片段在语音信息中的起止位置为t1至t4。
步骤203,根据候选语音片段的起止位置在语音信息中截取该候选语音片段。
服务器根据候选语音片段在语音信息中的起止位置,从语音信息中截取候选语音片段。
步骤204,将候选语音片段输入机器学习模型中,通过机器学习模型检测候选语音片段是否包含预设关键词。
机器学习模型包括卷积神经网络或加权有限状态机网络。服务器通过加权有限状态机网络对候选语音片段进行粗定位后,可通过卷积神经网络对候选语音片段进行检测,或,通过加权有限状态机网络对候选语音片段进行检测。示例性的,服务器通过卷积神经网络中的第一层卷积层对候选语音片段进行卷积处理后得到第一高层语义特征,将第一高层语义特征输入第一层池化层,得到一次压缩的高层语义特征,将一次压缩的高层语义特征输入第二层卷积层,得到第二高层语义特征,将第二高层语义特征输入至第二层池化层,得到二次压缩的高层语义特征……经过多次反复卷积和池化处理后,提取得到候选语音片段的高层语义特征。
示例性的,服务器通过加权有限状态机网络获取候选语音片段对应的最大后验概率的语言信息,检测该语言信息中是否包含预设关键词。
步骤205,若候选语音片段包含预设关键词,则确定语音信息包含预设关键词。
示例性的,若卷积神经网络输出候选语音片段包含预设关键词的结果,则服务器确定语音信息包含预设关键词。
示例性的,若候选语音片段对应的最大后验概率的语言信息中包含预设关键词,则服务器确定语音信息包含预设关键词。
需要说明的是,采用加权有限状态机网络对候选语音片段进行检测耗时较长,且相对于采用卷积神经网络对候选语音片段进行校验,准确度较低。
综上所述,本申请实施例中,通过机器学习模型对加权有限状态机网络粗定位 的候选语音片段进行校验,确定候选语音片段是否包含预设关键词,解决了相关技术中可能会将没有语义的语音信息识别为具有语义的语音信息从而导致误唤醒的问题,提高了语音识别的准确率。
请参考图3,其示出了本申请一个示例性的实施例提供的语音识别方法的方法流程图。该方法可以应用于如图1A和图1B所示的服务器130中,也可以应用于终端中,该方法可以是图2实施例中步骤202的一个实施方式,该方法包括:
步骤202a,将语音信息分帧,得到多帧语音片段。
示例性的,服务器通过移动窗对语音信息分帧,得到多帧语音片段。移动窗具有预设的窗口长度和步进长度,每一帧语音片段具有各自对应的起止位置和序号索引。
若语音信息是时域函数,窗口长度和步进长度以预设的时间长度为单位,如图4所示,移动窗400的窗口长度为20毫秒,步进长度为10毫秒,则移动窗400将语音信息分割为20毫秒长为一帧的语音信息,多帧语音片段之间的交叠长度为10毫秒。
步骤202b,将多帧语音片段输入至加权有限状态机网络得到多帧语音片段对应的最大后验概率的语言信息。
示例性的,如图5所示,加权有限状态机网络包括声学模型、词典和语言模型。其中,声学模型可以由深度神经网络和隐马尔可夫模型构成。
深度神经网络包含至少两层级联的深度神经网络层和全连接层,可根据输入的语音片段输出该语音片段对应的隐藏状态的后验概率的数学模型。图5中的V代表输入深度神经网络的语音片段,W代表深度神经网络层的中每一层神经网络层的参数,例如,W1代表第一层神经网络层的参数,WM代表第M层神经网络层的参数;h(i)代表深度神经网络层的中第i层神经网络层的输出结果,例如,h(1)代表第一层神经网络层的输出结果,h(M)代表第M层神经网络层的输出结果;Si代表第i种隐藏状态,例如,第1种隐藏状态S1、第K种隐藏状态SK;asisj代表第i种隐藏状态Si和第j种隐藏状态Sj之间的转移概率,例如,as1s2代表第1种隐藏状态S1和第2种隐藏状态S2之间转移概率。
隐马尔可夫模型是根据语音片段对应的隐藏状态的后验概率输出语音片段对应 的隐藏状态的数学模型。
词典是音素和单词的对应关系。将至少一个音素输入词典中可得到至少一个音素对应的最大后验概率的字或单词。
语言模型是单词与句法和/或语法的对应关系。将字或单词输入语言模型中,可得到单词对应的最大后验概率的语言信息,其中,语言信息可以是单词,也可以是句子。
服务器将多帧语音片段输入至深度神经网络中提取特征,得到每一帧语音片段对应的隐藏状态的后验概率,根据每一帧语音片段对应的隐藏状态的后验概率,通过隐马尔可夫模型得到每一帧语音片段对应的隐藏状态,根据每一帧语音片段对应的隐藏状态得到多帧语音片段对应的音素,通过词典得到多帧语音片段对应的最大后验概率的字或单词,根据多帧语音片段对应的最大后验概率的字或单词,通过语言模型将多帧语音片段对应的最大后验概率的语言信息。
由于上述转换过程都是选择最大后验概率的途径,因此将多帧语音片段输入至加权有限状态机网络得到的,是多帧语音片段对应的最大后验概率的语言信息。
步骤202c,若语言信息中包含预设关键词,则获取预设关键词对应的候选语音片段在语音信息中的起止位置。在该步骤中,根据预设关键词对应的音素确定预设关键词对应的候选语音片段在语音信息中的起止位置。
服务器检测多帧语音片段对应的最大后验概率的语言信息是否包含预设关键词,若确定语言信息中包含预设关键词,则获取预设关键词对应的候选语音片段在语音信息中的起止位置。
示例性的,一帧语音片段对应一个隐藏状态,至少一个隐藏状态对应一个音素,至少一个音素对应一个单词,通过预设关键词中每个单词对应的音素,得到关键词对应的候选语音片段。由于在对语音信息分帧时对每个语音片段标注了序号索引,且每个语音片段都具有起止位置属性,因此可获取候选语音片段在语音信息中的起止位置。
综上所述,本申请实施例中,通过将多帧语音片段输入至加权有限状态机网络得到多帧语音片段对应的最大后验概率的语言信息,若语言信息中包含预设关键词,则获取预设关键词对应的候选语音片段在语音信息中的起止位置,能够提高对候选语音片段识别的准确率。
进一步的,本申请实施例中,通过将多帧语音片段输入深度神经网络得到每一帧语音片段对应的隐藏状态的后验概率,由于深度神经网络具有较强的提取特征能力,因此通过深度神经网络得到的每一帧语音片段对应的隐藏状态的后验概率更为准确,从而提高了对候选语音片段识别的准确率。
请参考图6,其示出了本申请一个示例性的实施例提供的语音识别方法的方法流程图。该方法可以应用于如图1A和图1B所示的服务器130中,也可以应用于终端中,该方法可以是图2实施例中步骤204的一个实施方式,该方法包括:
步骤204a,将候选语音片段输入卷积神经网络中。
服务器通过图2实施例或图3实施例中的方法获取到候选语音片段后,将候选语音片段输入卷积神经网络中。
示例性的,如图7A所示,卷积神经网络包含至少两层卷积层、一个全连接层和一个软最大化函数,每一层卷积层之后还包含一层池化层。图中以两层卷积层为例进行说明,不表示卷积神经网络仅仅包含两层卷积层。
步骤204b,通过卷积神经网络对候选语音片段进行卷积和池化提取得到候选语音片段的高层语义特征。
示例性的,服务器通过卷积神经网络中的第一层卷积层对候选语音片段进行卷积处理后得到第一高层语义特征,将第一高层语义特征输入第一层池化层,得到一次压缩的高层语义特征,将一次压缩的高层语义特征输入第二层卷积层,得到第二高层语义特征,将第二高层语义特征输入至第二层池化层,得到二次压缩的高层语义特征……经过多次反复卷积和池化处理后,提取得到候选语音片段的高层语义特征。
步骤204c,通过卷积神经网络中的全连接层和软最大化函数对候选语音片段的高层语义特征进行分类,检测候选语音片段是否包含预设关键词。
示例性的,候选语音片段通过多层卷积层和池化层处理后得到高层语义特征,由全连接层将每一层卷积层和池化层提取到的高层语义特征连接起来,输送至软最大化函数,软最大化函数对高层语义特征进行分类,输出候选语音片段是否包含预设关键词的结果。
图7B是本申请实施例的整体架构图,如图所示,多帧语音片段输入至声学模 型后得到多帧语音片段对应的最大后验概率的音素,通过词典得到多帧语音片段对应的最大后验概率的字或单词,通过语言模型得到多帧语音片段对应的最大后验概率的单词或句子,从而检测单词或句子中是否包含预设关键词,若包含,则截取预设关键词对应的候选语音片段,将候选语音片段输入至卷积神经网络中校验,输出最终校验结果。
综上所述,本申请实施例中,通过将候选语音片段输入至卷积神经网络经过卷积和池化后提取得到候选语音片段的高层语义特征,通过全连接层将提取到的高层语义特征连接起来输送至软最大化函数进行分类,得到候选语音片段是否包含预设关键词的结果,由于候选语音片段是通过加权有限状态机网络初步定位得到的,在保证识别率的基础上,提高了语音识别的准确率。
请参考图8,其示出了本申请一个示例性的实施例提供的语音识别方法的方法流程图。该方法可以应用于如图1A所示的实施环境中,该方法包括:
步骤801,终端将获取到的原始语音信号发送至服务器。
示例性的,终端的静音检测模块判断是否有用户声音,若确定有用户声音则激活静音检测模块对用户声音录音并得到相应的原始语音信号,并将原始语音信号通过有线或无线网络发送至服务器。
步骤802,服务器对原始语音信号进行初步提取特征,得到语音信息。
服务器对接收到的原始语音信号进行初步提取特征,得到语音信息,该语音信息是时域或频域的函数。
步骤803,服务器将语音信息分帧,得到多帧语音片段。
示例性的,服务器通过移动窗对语音信息分帧,得到多帧语音片段。其中,移动窗具有预设的窗口长度和步进长度,每一帧语音片段具有各自对应的起止位置和序号索引。
步骤804,服务器将多帧语音片段输入深度神经网络中,得到多帧语音片段中每一帧语音片段和对应的隐藏状态之间的后验概率。
深度神经网络输出的是每一帧语音片段和对应的隐藏状态之间的后验概率,因此通过深度神经网络还无法得到每一帧语音片段所对应的隐藏状态,需要对每一帧语音片段通过隐马尔可夫模型进行前向解码。
步骤805,服务器通过贝叶斯公式对每一帧语音片段对应的隐藏状态的后验概率进行转换,得到每一帧语音片段对应的隐藏状态的发射概率。
示例性的,对每一帧语音片段通过隐马尔可夫模型进行前向解码,需要语音片段对应的隐藏状态的发射概率。服务器通过贝叶斯公式对每一帧语音片段对应的隐藏状态的后验概率进行转换,得到每一帧语音片段对应的隐藏状态的发射概率。
步骤806,服务器根据每一帧语音片段对应的隐藏状态的发射概率、隐马尔可夫模型中每个隐藏状态的初始概率以及每个隐藏状态之间的转移概率,通过隐马尔可夫模型进行前向解码得到多帧语音片段对应的最大后验概率的隐藏状态。
隐马尔可夫模型中每个隐藏状态的初始概率以及每个隐藏状态之间的转移概率是已经训练好的参数。根据步骤804中得到的每一帧语音片段对应的隐藏状态的发射概率,结合每个隐藏状态的初始概率以及每个隐藏状态之间的转移概率,通过隐马尔可夫模型对每一帧语音片段进行前向解码得到多帧语音片段对应的最大后验概率的隐藏状态。
步骤807,服务器根据每一帧语音片段对应的隐藏状态得到多帧语音片段对应的音素。
音素由至少一个隐藏状态构成,服务器根据每一帧语音片段对应的隐藏状态得到多帧语音片段对应的音素。
步骤808,服务器根据多帧语音片段对应的音素,结合词典和语言模型得到多帧语音片段对应的最大后验概率的语言信息。
单词由至少一个音素组成,词典中包含单词和音素的对应关系。服务器通过词典得到多帧语音片段对应的最大后验概率的字或单词,根据多帧语音片段对应的最大后验概率的字或单词,通过语言模型将多帧语音片段对应的最大后验概率的语言信息。其中,语言信息可以是单词,也可以是句子,语言模型是单词与语法和/或句法的对应关系。
上述词典中单词和音素的对应关系,以及语言模型中单词与语法和/或句法的对应关系是一种概率对应关系,服务器根据多帧语音片段对应的音素,通过词典和语言模型得到多帧语音片段对应的最大后验概率的语言信息,是根据最大的概率寻找路径得到的多帧语音片段对应的最大后验概率的语言信息。
步骤809,若语言信息中包含预设关键词,则服务器获取预设关键词对应的候 选语音片段在语音信息中的起止位置。在该步骤中,服务器根据预设关键词对应的音素确定预设关键词对应的候选语音片段在语音信息中的起止位置。
服务器检测多帧语音片段对应的最大后验概率的语言信息是否包含预设关键词,若确定语言信息中包含预设关键词,则确定预设关键词对应的候选语音片段在语音信息中的起止位置。若多帧语音片段对应的最大后验概率的语言信息不包括预设关键词,则停止步骤。
步骤810,服务器根据候选语音片段在语音信息中的起止位置,在语音信息中截取候选语音片段。
服务器根据候选语音片段在语音信息中的起止位置,从语音信息中截取候选语音片段。
步骤811,服务器将候选语音片段输入所述卷积神经网络中,通过卷积神经网络对候选语音片段进行卷积和池化提取得到候选语音片段的高层语义特征。
示例性的,服务器将截取到的候选语音片段输入所述卷积神经网络中,通过卷积神经网络中的第一层卷积层对候选语音片段进行卷积处理后得到第一高层语义特征,将第一高层语义特征输入第一层池化层,得到一次压缩的高层语义特征,将一次压缩的高层语义特征输入第二层卷积层,得到第二高层语义特征,将第二高层语义特征输入至第二层池化层,得到二次压缩的高层语义特征……经过多次反复卷积和池化处理后,提取得到候选语音片段的高层语义特征。
步骤812,服务器通过卷积神经网络中的全连接阶层和软最大化函数对候选语音片段的高层语义特征进行分类,检测候选语音片段是否包含所述预设关键词。
示例性的,候选语音片段通过多层卷积层和池化层处理后得到高层语义特征,由全连接层将每一层卷积层和池化层提取到的高层语义特征连接起来,输送至软最大化函数,软最大化函数对高层语义特征进行分类,输出候选语音片段是否包含预设关键词的结果。
步骤813,若候选语音片段中包含预设关键词,服务器向终端发送唤醒指令。
若卷积神经网络输出的结果为候选语音片段中包含预设关键词,服务器通过有线或无线网络向终端发送唤醒指令。
步骤814,终端根据唤醒指令解除本机的休眠状态和/或锁屏状态。
终端在接收到服务器发送的唤醒指令后,根据该唤醒指令解除本机的休眠状态 和/或锁屏状态。
综上所述,本申请实施例中,通过卷积神经网络对加权有限状态机网络粗定位的候选语音片段进行校验,确定候选语音片段是否包含预设关键词,解决了相关技术中可能会将没有语义的语音信息识别为具有语义的语音信息从而导致误唤醒的问题,提高了语音识别的准确率。
进一步的,本申请实施例中,通过将多帧语音片段输入至加权有限状态机网络得到多帧语音片段对应的最大后验概率的语言信息,若语言信息中包含预设关键词,则获取预设关键词对应的候选语音片段在语音信息中的起止位置,能够提高对候选语音片段识别的准确率。
进一步的,本申请实施例中,通过将多帧语音片段输入深度神经网络得到每一帧语音片段对应的隐藏状态的后验概率,由于深度神经网络具有较强的提取特征能力,因此通过深度神经网络得到的每一帧语音片段对应的隐藏状态的后验概率更为准确,从而提高了对候选语音片段识别的准确率。
进一步的,本申请实施例中,通过将候选语音片段输入至卷积神经网络经过卷积和池化后提取得到候选语音片段的高层语义特征,通过全连接层将提取到的高层语义特征连接起来输送至软最大化函数进行分类,得到候选语音片段是否包含预设关键词的结果,由于候选语音片段是通过加权有限状态机网络初步定位得到的,在保证识别率的基础上,提高了语音识别的准确率。
图9和图10示出了本申请一个示例性的实施例提供的语音识别方法的应用场景。
在图9的应用场景中,智能机器人910、智能音箱920、智能移动电话930等终端将获取的语音信息通过有线或无线网络传输至云端,云端通过上述实施例中的方法检测每条语音信息中是否包含各自对应的预设关键词,若包含预设关键词,则向对应的终端发送唤醒指令,将该终端从休眠和/或锁屏状态中解除。
图10提供了一种离线的语音识别应用场景,用户1010向电子设备1020说出唤醒词(即预设关键词),电子设备1010检测到用户说出唤醒词后,录音得到原始语音信号,通过对原始语音信号进行初步提取特征,通过上述实施例中的方法检测语音信息中是否包含预设关键词,若包含预设关键词,则解除本机的休眠和/或锁屏状 态。
通常电子设备的运算资源比较有限,需要针对不同硬件的电子设备进行定制,定制流程为:电子设备厂商提交电子设备能够划分给语音唤醒模块的硬件资源;服务器收到厂商的提交的数据后,根据该电子设备能够划分的硬件资源状况设计出可在该电子设备上运行的模型;采用与电子设备应用环境切合的训练数据训练模型;对所得模型进行联合测试与针对性调优,通过后下发给电子设备厂商进行集成;完成集成后,用户可在离线环境下进行电子设备唤醒,唤醒方法与在线服务相同。
请参考图11,其示出了本申请一个示例性的实施例提供的语音识别装置的结构框图,如图所示,该装置可应用于如图1所示的终端110、终端120或服务器130中,该装置包括获取模块1110和处理模块1120:
获取模块1110,用于获取语音信息。
处理模块1120,用于通过加权有限状态机网络确定语音信息中的候选语音片段的起止位置;根据起止位置在语音信息中截取候选语音片段;将候选语音片段输入机器学习模型中,通过机器学习模型检测候选语音片段是否包含预设关键词;若候选语音片段包含预设关键词,则确定语音信息包含预设关键词。
在一个实施例中,
处理模块1110,还用于将语音信息分帧,得到多帧语音片段;将多帧语音片段输入至加权有限状态机网络中,得到多帧语音片段对应的最大后验概率的语言信息。
获取模块1120,还用于若语言信息中包含预设关键词,则确定预设关键词对应的候选语音片段在语音信息中的起止位置;候选语音片段至少包括多帧语音片段中的一帧语音片段。
在一个实施例中,加权有限状态机网络包括深度神经网络、隐马尔可夫模型、词典和语言模型;
处理模块1120,还用于将多帧语音片段输入至深度神经网络中,得到多帧语音片段中每一帧语音片段对应的隐藏状态的后验概率;根据每一帧语音片段对应的隐藏状态的后验概率,通过隐马尔可夫模型得到每一帧语音片段对应的隐藏状态;根据每一帧语音片段对应的隐藏状态得到多帧语音片段对应的音素;根据多帧语音片段对应的音素,结合词典和语言模型得到多帧语音片段对应的最大后验概率的语言 信息;其中,所述词典包括所述音素和单词的对应关系,所述语言模型包括所述单词与语法和/或句法的对应关系。
在一个实施例中,
处理模块1120,还用于通过贝叶斯公式对每一帧语音片段对应的隐藏状态的后验概率进行转换,得到每一帧语音片段对应的隐藏状态的发射概率;根据每一帧语音片段对应的隐藏状态的发射概率,隐马尔可夫模型中每个隐藏状态的初始概率以及每个隐藏状态之间的转移概率,通过隐马尔可夫模型进行前向解码得到每一帧语音片段对应的隐藏状态。
在一个实施例中,
处理模块1120,还用于将候选语音片段输入卷积神经网络中;通过卷积神经网络对候选语音片段进行卷积和池化提取得到候选语音片段的高层语义特征;通过卷积神经网络中的全连接层和软最大化函数对候选语音片段的高层语义特征进行分类,检测候选语音片段是否包含所述预设关键词。
请参见图12,其示出了本申请一个示例性的实施例提供的语音处理设备的结构框图。该设备包括:处理器1210以及存储器1220。
处理器1210可以是中央处理器(英文:central processing unit,CPU),网络处理器(英文:network processor,NP)或者CPU和NP的组合。处理器1210还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(英文:application-specific integrated circuit,ASIC),可编程逻辑器件(英文:programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(英文:complex programmable logic device,CPLD),现场可编程逻辑门阵列(英文:field-programmable gate array,FPGA),通用阵列逻辑(英文:generic array logic,GAL)或其任意组合。
存储器1220通过总线或其它方式与处理器1210相连,存储器1220中存储有至少一条指令、至少一段程序、代码集或指令集,上述至少一条指令、至少一段程序、代码集或指令集由处理器1210加载并执行以实现如图2、图3、图6或图8的语音处理方法。存储器1220可以为易失性存储器(英文:volatile memory),非易失性存储器(英文:non-volatile memory)或者它们的组合。易失性存储器可以为随机存取存储器(英文:random-access memory,RAM),例如静态随机存取存储器(英文: static random access memory,SRAM),动态随机存取存储器(英文:dynamic random access memory,DRAM)。非易失性存储器可以为只读存储器(英文:read only memory image,ROM),例如可编程只读存储器(英文:programmable read only memory,PROM),可擦除可编程只读存储器(英文:erasable programmable read only memory,EPROM),电可擦除可编程只读存储器(英文:electrically erasable programmable read-only memory,EEPROM)。非易失性存储器也可以为快闪存储器(英文:flash memory),磁存储器,例如磁带(英文:magnetic tape),软盘(英文:floppy disk),硬盘。非易失性存储器也可以为光盘。
本申请还提供一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述方法实施例提供的语音处理方法。
本申请还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述各方面所述的语音处理方法。
应当理解的是,在本文中提及的“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的较佳实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (13)

  1. 一种语音识别方法,由终端或服务器执行,其特征在于,所述方法包括:
    获取语音信息;
    通过加权有限状态机网络确定所述语音信息中的候选语音片段的起止位置;
    根据所述起止位置在所述语音信息中截取所述候选语音片段;
    将所述候选语音片段输入机器学习模型中,通过所述机器学习模型检测所述候选语音片段是否包含预设关键词;
    若所述候选语音片段包含所述预设关键词,则确定所述语音信息包含所述预设关键词。
  2. 根据权利要求1所述的方法,其特征在于,所述通过加权有限状态机网络确定所述语音信息中的候选语音片段和所述候选语音片段的起止位置,包括:
    将所述语音信息分帧,得到多帧语音片段;
    将所述多帧语音片段输入至所述加权有限状态机网络中,得到所述多帧语音片段对应的最大后验概率的语言信息;
    若所述语言信息中包含预设关键词,则确定所述预设关键词对应的候选语音片段在所述语音信息中的起止位置;所述候选语音片段至少包括所述多帧语音片段中的一帧语音片段。
  3. 根据权利要求2所述的方法,其特征在于,所述加权有限状态机网络包括深度神经网络、隐马尔可夫模型、词典和语言模型,所述将所述多帧语音片段输入至加权有限状态机网络中,得到所述多帧语音片段对应的最大后验概率的语言信息,包括:
    将所述多帧语音片段输入至所述深度神经网络中,得到所述多帧语音片段中每一帧语音片段对应的隐藏状态的后验概率;
    根据所述每一帧语音片段对应的隐藏状态的后验概率,通过所述隐马尔可夫模型得到所述每一帧语音片段对应的隐藏状态;
    根据所述每一帧语音片段对应的隐藏状态得到所述多帧语音片段对应的音素;
    根据所述多帧语音片段对应的音素,结合所述词典和所述语言模型得到所述多 帧语音片段对应的最大后验概率的语言信息;
    其中,所述词典包括所述音素和单词的对应关系,所述语言模型包括所述单词与语法和/或句法的对应关系。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述每一帧语音片段对应的隐藏状态的后验概率,通过所述隐马尔可夫模型得到所述每一帧语音片段对应的隐藏状态,包括:
    通过贝叶斯公式对所述每一帧语音片段对应的隐藏状态的后验概率进行转换,得到所述每一帧语音片段对应的隐藏状态的发射概率;
    根据所述每一帧语音片段对应的隐藏状态的发射概率,所述隐马尔可夫模型中每个隐藏状态的初始概率以及所述每个隐藏状态之间的转移概率,通过所述隐马尔可夫模型进行前向解码得到所述每一帧语音片段对应的隐藏状态。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述机器学习模型为卷积神经网络,所述将所述候选语音片段输入机器学习模型中,通过所述机器学习模型检测所述候选语音片段是否包含预设关键词,包括:
    将所述候选语音片段输入所述卷积神经网络中;
    通过所述卷积神经网络对所述候选语音片段进行卷积和池化提取得到所述候选语音片段的高层语义特征;
    通过所述卷积神经网络中的全连接层和软最大化函数对所述候选语音片段的高层语义特征进行分类,检测所述候选语音片段是否包含所述预设关键词。
  6. 一种语音唤醒方法,其特征在于,所述方法包括:
    终端将获取到的语音信息发送至服务器;
    所述服务器检测所述语音信息中是否包含预设关键词;
    若所述语音信息中包含所述预设关键词,则所述服务器在所述语音信息中截取候选语音片段;所述候选语音片段是所述预设关键词对应的语音信息片段;
    所述服务器对所述候选语音片段进行校验,再次检测所述候选语音片段中是否包含所述预设关键词;
    若所述候选语音片段中包含所述预设关键词,则向所述终端发送唤醒指令;
    所述终端根据所述唤醒指令解除所述本机的休眠状态和/或锁屏状态。
  7. 一种语音识别装置,其特征在于,所述装置包括:
    获取模块,用于获取语音信息;
    处理模块,用于通过加权有限状态机网络确定所述语音信息中的候选语音片段的起止位置;根据所述起止位置在所述语音信息中截取所述候选语音片段;将所述候选语音片段输入机器学习模型中,通过所述机器学习模型检测所述候选语音片段是否包含所述预设关键词;若所述候选语音片段包含所述预设关键词,则确定所述语音信息包含预设关键词。
  8. 根据权利要求7所述的装置,其特征在于,
    所述处理模块,还用于将所述语音信息分帧,得到多帧语音片段;将所述多帧语音片段输入至所述加权有限状态机网络中,得到所述多帧语音片段对应的最大后验概率的语言信息;
    所述获取模块,还用于若所述语言信息中包含预设关键词,则确定所述预设关键词对应的候选语音片段在所述语音信息中的起止位置;所述候选语音片段至少包括所述多帧语音片段中的一帧语音片段。
  9. 根据权利要求8所述的装置,其特征在于,所述加权有限状态机网络包括深度神经网络、隐马尔可夫模型、词典和语言模型;
    所述处理模块,还用于将所述多帧语音片段输入至所述深度神经网络中,得到所述多帧语音片段中每一帧语音片段对应的隐藏状态的后验概率;根据所述每一帧语音片段对应的隐藏状态的后验概率,通过所述隐马尔可夫模型得到所述每一帧语音片段对应的隐藏状态;根据所述每一帧语音片段对应的隐藏状态得到所述多帧语音片段对应的音素;根据所述多帧语音片段对应的音素,结合所述词典和所述语言模型得到所述多帧语音片段对应的最大后验概率的语言信息;
    其中,所述词典包括所述音素和单词的对应关系,所述语言模型包括所述单词与语法和/或句法的对应关系。
  10. 根据权利要求9所述的装置,其特征在于,
    所述处理模块,还用于通过贝叶斯公式对所述每一帧语音片段对应的隐藏状态的后验概率进行转换,得到所述每一帧语音片段对应的隐藏状态的发射概率;根据所述每一帧语音片段对应的隐藏状态的发射概率,所述隐马尔可夫模型中每个隐藏状态的初始概率以及所述每个隐藏状态之间的转移概率,通过所述隐马尔可夫模型进行前向解码得到所述每一帧语音片段对应的隐藏状态。
  11. 根据权利要求7至10任一项所述的装置,其特征在于,所述机器学习模型为卷积神经网络;
    所述处理模块,还用于将所述候选语音片段输入所述卷积神经网络中;通过所述卷积神经网络对所述候选语音片段进行卷积和池化提取得到所述候选语音片段的高层语义特征;通过所述卷积神经网络中的全连接层和软最大化函数对所述候选语音片段的高层语义特征进行分类,检测所述候选语音片段是否包含所述预设关键词。
  12. 一种语音识别设备,其特征在于,包括处理器和存储器,所述存储器中存储有至少一条指令,所述至少一条指令由所述处理器加载并执行以实现如权利要求1至5任一所述的语音识别方法。
  13. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有至少一条指令,至少一条指令由处理器加载并执行以实现权利要求1至5任一所述的语音识别方法。
PCT/CN2019/076223 2018-03-22 2019-02-27 语音识别方法、装置、设备及存储介质 WO2019179285A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2020542123A JP6980119B2 (ja) 2018-03-22 2019-02-27 音声認識方法、並びにその装置、デバイス、記憶媒体及びプログラム
EP19770634.4A EP3770905A4 (en) 2018-03-22 2019-02-27 VOICE RECOGNITION METHOD, DEVICE AND DEVICE AND STORAGE MEDIUM
US16/900,824 US11450312B2 (en) 2018-03-22 2020-06-12 Speech recognition method, apparatus, and device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810240076.X 2018-03-22
CN201810240076.XA CN108564941B (zh) 2018-03-22 2018-03-22 语音识别方法、装置、设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/900,824 Continuation US11450312B2 (en) 2018-03-22 2020-06-12 Speech recognition method, apparatus, and device, and storage medium

Publications (1)

Publication Number Publication Date
WO2019179285A1 true WO2019179285A1 (zh) 2019-09-26

Family

ID=63533050

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/076223 WO2019179285A1 (zh) 2018-03-22 2019-02-27 语音识别方法、装置、设备及存储介质

Country Status (5)

Country Link
US (1) US11450312B2 (zh)
EP (1) EP3770905A4 (zh)
JP (1) JP6980119B2 (zh)
CN (1) CN108564941B (zh)
WO (1) WO2019179285A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516997A (zh) * 2021-04-26 2021-10-19 常州分音塔科技有限公司 一种语音事件识别装置和方法

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564941B (zh) 2018-03-22 2020-06-02 腾讯科技(深圳)有限公司 语音识别方法、装置、设备及存储介质
CN108566634B (zh) * 2018-03-30 2021-06-25 深圳市冠旭电子股份有限公司 降低蓝牙音箱连续唤醒延时的方法、装置及蓝牙音箱
CN109273007B (zh) * 2018-10-11 2022-05-17 西安讯飞超脑信息科技有限公司 语音唤醒方法及装置
CN109378000B (zh) * 2018-12-19 2022-06-07 科大讯飞股份有限公司 语音唤醒方法、装置、系统、设备、服务器及存储介质
CN109741752A (zh) * 2018-12-27 2019-05-10 金现代信息产业股份有限公司 一种基于语音识别的人事考评方法与系统
US11158307B1 (en) * 2019-03-25 2021-10-26 Amazon Technologies, Inc. Alternate utterance generation
CN110211588A (zh) * 2019-06-03 2019-09-06 北京达佳互联信息技术有限公司 语音识别方法、装置及电子设备
CN110335592B (zh) * 2019-06-28 2022-06-03 腾讯科技(深圳)有限公司 语音音素识别方法和装置、存储介质及电子装置
CN110473536B (zh) * 2019-08-20 2021-10-15 北京声智科技有限公司 一种唤醒方法、装置和智能设备
CN110995938B (zh) * 2019-12-13 2022-04-26 度小满科技(北京)有限公司 数据处理方法和装置
CN111432305A (zh) * 2020-03-27 2020-07-17 歌尔科技有限公司 一种耳机告警方法、装置及无线耳机
CN111522592A (zh) * 2020-04-24 2020-08-11 腾讯科技(深圳)有限公司 一种基于人工智能的智能终端唤醒方法和装置
CN112259077B (zh) * 2020-10-20 2024-04-09 网易(杭州)网络有限公司 语音识别方法、装置、终端和存储介质
CN112002308B (zh) * 2020-10-30 2024-01-09 腾讯科技(深圳)有限公司 一种语音识别方法及装置
CN112530408A (zh) * 2020-11-20 2021-03-19 北京有竹居网络技术有限公司 用于识别语音的方法、装置、电子设备和介质
CN112634897B (zh) * 2020-12-31 2022-10-28 青岛海尔科技有限公司 设备唤醒方法、装置和存储介质及电子装置
CN113782005B (zh) * 2021-01-18 2024-03-01 北京沃东天骏信息技术有限公司 语音识别方法及装置、存储介质及电子设备
CN113761841B (zh) * 2021-04-19 2023-07-25 腾讯科技(深圳)有限公司 将文本数据转换为声学特征的方法
CN113129874B (zh) * 2021-04-27 2022-05-10 思必驰科技股份有限公司 语音唤醒方法及系统
CN113707135B (zh) * 2021-10-27 2021-12-31 成都启英泰伦科技有限公司 一种高精度连续语音识别的声学模型训练方法
CN114038457B (zh) * 2021-11-04 2022-09-13 贝壳找房(北京)科技有限公司 用于语音唤醒的方法、电子设备、存储介质和程序
US11770268B2 (en) * 2022-02-14 2023-09-26 Intel Corporation Enhanced notifications for online collaboration applications

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9390708B1 (en) * 2013-05-28 2016-07-12 Amazon Technologies, Inc. Low latency and memory efficient keywork spotting
CN106157950A (zh) * 2016-09-29 2016-11-23 合肥华凌股份有限公司 语音控制系统及其唤醒方法、唤醒装置和家电、协处理器
CN106448663A (zh) * 2016-10-17 2017-02-22 海信集团有限公司 语音唤醒方法及语音交互装置
CN107622770A (zh) * 2017-09-30 2018-01-23 百度在线网络技术(北京)有限公司 语音唤醒方法及装置
CN107767863A (zh) * 2016-08-22 2018-03-06 科大讯飞股份有限公司 语音唤醒方法、系统及智能终端
CN108564941A (zh) * 2018-03-22 2018-09-21 腾讯科技(深圳)有限公司 语音识别方法、装置、设备及存储介质

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1639768B (zh) * 2002-08-01 2010-05-26 艾利森电话股份有限公司 自动语音识别方法及装置
US8244522B2 (en) * 2007-05-22 2012-08-14 Honda Motor Co., Ltd. Language understanding device
US9477753B2 (en) * 2013-03-12 2016-10-25 International Business Machines Corporation Classifier-based system combination for spoken term detection
JP2014232258A (ja) * 2013-05-30 2014-12-11 株式会社東芝 連携業務支援装置、方法およびプログラム
JP6176055B2 (ja) * 2013-10-21 2017-08-09 富士通株式会社 音声検索装置及び音声検索方法
US9196243B2 (en) * 2014-03-31 2015-11-24 International Business Machines Corporation Method and system for efficient spoken term detection using confusion networks
CN107211062B (zh) * 2015-02-03 2020-11-03 杜比实验室特许公司 虚拟声学空间中的音频回放调度
EP3254456B1 (en) * 2015-02-03 2020-12-30 Dolby Laboratories Licensing Corporation Optimized virtual scene layout for spatial meeting playback
CN107211027B (zh) * 2015-02-03 2020-09-15 杜比实验室特许公司 感知质量比会议中原始听到的更高的后会议回放系统
US10516782B2 (en) * 2015-02-03 2019-12-24 Dolby Laboratories Licensing Corporation Conference searching and playback of search results
WO2016126770A2 (en) * 2015-02-03 2016-08-11 Dolby Laboratories Licensing Corporation Selective conference digest
EP3254453B1 (en) * 2015-02-03 2019-05-08 Dolby Laboratories Licensing Corporation Conference segmentation based on conversational dynamics
WO2016126768A2 (en) * 2015-02-03 2016-08-11 Dolby Laboratories Licensing Corporation Conference word cloud
US9704482B2 (en) * 2015-03-11 2017-07-11 International Business Machines Corporation Method and system for order-free spoken term detection
EP3311558B1 (en) * 2015-06-16 2020-08-12 Dolby Laboratories Licensing Corporation Post-teleconference playback using non-destructive audio transport
KR102371188B1 (ko) * 2015-06-30 2022-03-04 삼성전자주식회사 음성 인식 장치 및 방법과 전자 장치
CN105679316A (zh) * 2015-12-29 2016-06-15 深圳微服机器人科技有限公司 一种基于深度神经网络的语音关键词识别方法及装置
CN110444199B (zh) * 2017-05-27 2022-01-07 腾讯科技(深圳)有限公司 一种语音关键词识别方法、装置、终端及服务器
CN107578776B (zh) * 2017-09-25 2021-08-06 咪咕文化科技有限公司 一种语音交互的唤醒方法、装置及计算机可读存储介质
US11295739B2 (en) * 2018-08-23 2022-04-05 Google Llc Key phrase spotting
US11308958B2 (en) * 2020-02-07 2022-04-19 Sonos, Inc. Localized wakeword verification

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9390708B1 (en) * 2013-05-28 2016-07-12 Amazon Technologies, Inc. Low latency and memory efficient keywork spotting
CN107767863A (zh) * 2016-08-22 2018-03-06 科大讯飞股份有限公司 语音唤醒方法、系统及智能终端
CN106157950A (zh) * 2016-09-29 2016-11-23 合肥华凌股份有限公司 语音控制系统及其唤醒方法、唤醒装置和家电、协处理器
CN106448663A (zh) * 2016-10-17 2017-02-22 海信集团有限公司 语音唤醒方法及语音交互装置
CN107622770A (zh) * 2017-09-30 2018-01-23 百度在线网络技术(北京)有限公司 语音唤醒方法及装置
CN108564941A (zh) * 2018-03-22 2018-09-21 腾讯科技(深圳)有限公司 语音识别方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3770905A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516997A (zh) * 2021-04-26 2021-10-19 常州分音塔科技有限公司 一种语音事件识别装置和方法

Also Published As

Publication number Publication date
CN108564941A (zh) 2018-09-21
JP6980119B2 (ja) 2021-12-15
US11450312B2 (en) 2022-09-20
JP2021515905A (ja) 2021-06-24
EP3770905A4 (en) 2021-05-19
US20200312309A1 (en) 2020-10-01
EP3770905A1 (en) 2021-01-27
CN108564941B (zh) 2020-06-02

Similar Documents

Publication Publication Date Title
WO2019179285A1 (zh) 语音识别方法、装置、设备及存储介质
US11503155B2 (en) Interactive voice-control method and apparatus, device and medium
WO2019149108A1 (zh) 语音关键词的识别方法、装置、计算机可读存储介质及计算机设备
US10460721B2 (en) Dialogue act estimation method, dialogue act estimation apparatus, and storage medium
US11776530B2 (en) Speech model personalization via ambient context harvesting
CN110364143B (zh) 语音唤醒方法、装置及其智能电子设备
WO2018227780A1 (zh) 语音识别方法、装置、计算机设备及存储介质
Dillon et al. A single‐stage approach to learning phonological categories: Insights from Inuktitut
Du et al. Deepcruiser: Automated guided testing for stateful deep learning systems
WO2018192186A1 (zh) 语音识别方法及装置
US11195522B1 (en) False invocation rejection for speech processing systems
US20240029739A1 (en) Sensitive data control
US20230089308A1 (en) Speaker-Turn-Based Online Speaker Diarization with Constrained Spectral Clustering
US20180349794A1 (en) Query rejection for language understanding
US20230031733A1 (en) Method for training a speech recognition model and method for speech recognition
Salekin et al. Distant emotion recognition
CN115457938A (zh) 识别唤醒词的方法、装置、存储介质及电子装置
WO2024093578A1 (zh) 语音识别方法、装置、电子设备、存储介质及计算机程序产品
WO2021159756A1 (zh) 基于多模态的响应义务检测方法、系统及装置
US11437043B1 (en) Presence data determination and utilization
Fujita et al. Robust DNN-Based VAD Augmented with Phone Entropy Based Rejection of Background Speech.
Kadyan et al. Developing in-vehicular noise robust children ASR system using Tandem-NN-based acoustic modelling
CN112912954B (zh) 电子装置及其控制方法
US20240105206A1 (en) Seamless customization of machine learning models
US11869531B1 (en) Acoustic event detection model selection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19770634

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020542123

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2019770634

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2019770634

Country of ref document: EP

Effective date: 20201022