WO2024011885A1 - 语音唤醒方法、装置、电子设备以及存储介质 - Google Patents

语音唤醒方法、装置、电子设备以及存储介质 Download PDF

Info

Publication number
WO2024011885A1
WO2024011885A1 PCT/CN2023/072618 CN2023072618W WO2024011885A1 WO 2024011885 A1 WO2024011885 A1 WO 2024011885A1 CN 2023072618 W CN2023072618 W CN 2023072618W WO 2024011885 A1 WO2024011885 A1 WO 2024011885A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
wake
recognized
syllable
voice
Prior art date
Application number
PCT/CN2023/072618
Other languages
English (en)
French (fr)
Inventor
邹赛赛
贾磊
王海峰
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2024011885A1 publication Critical patent/WO2024011885A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Definitions

  • the present disclosure relates to the technical field of artificial intelligence, especially to the technical fields of human-computer interaction, deep learning, intelligent voice and other technical fields. It specifically involves voice wake-up methods, devices, electronic equipment, storage media and program products.
  • Voice interaction is a natural way for humans to interact. With the continuous development of artificial intelligence technology, it has been achieved that machines can understand human speech, understand the inherent meaning in speech, and make corresponding feedback. In these operations, the response speed of awakening, the difficulty of awakening, the accurate understanding of semantics, and the rapidity of feedback have all become factors that affect the smoothness of voice interaction.
  • the present disclosure provides a voice wake-up method, device, electronic device, storage medium and program product.
  • a voice awakening method which includes: performing word recognition on the speech to be recognized to obtain a wake-up word recognition result; and determining that the wake-up word recognition result is used to represent the situation that the speech to be recognized includes a predetermined wake-up word.
  • syllable recognition is performed on the speech to be recognized to obtain an awakening syllable recognition result; and when it is determined that the awakening syllable recognition result is used to represent that the speech to be recognized includes predetermined syllables, it is determined that the speech to be recognized is the correct awakening speech.
  • a voice wake-up device including: a word recognition module for performing word recognition on the speech to be recognized to obtain a wake-up word recognition result; a syllable recognition module for determining the wake-up word recognition result It is used to represent that when the speech to be recognized includes a predetermined wake-up word, syllable recognition is performed on the speech to be recognized to obtain a wake-up syllable recognition result; and a first determination module is used to determine that the wake-up syllable recognition result is used to characterize the above-mentioned wake-up word.
  • the recognized voice includes predetermined syllables, it is determined that the voice to be recognized is the correct wake-up voice.
  • an electronic device including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores information that can be processed by the at least one processor.
  • the above instructions are executed by the above-mentioned at least one processor, so that the above-mentioned at least one processor can execute the method of the present disclosure.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method of the present disclosure.
  • a computer program product including a computer program that, when executed by a processor, implements the method of the present disclosure.
  • Figure 1 schematically illustrates an exemplary system architecture in which voice wake-up methods and devices can be applied according to embodiments of the present disclosure
  • Figure 2 schematically shows a flow chart of a voice wake-up method according to an embodiment of the present disclosure
  • Figure 3 schematically shows a network structure diagram of a wake word recognition model according to an embodiment of the present disclosure
  • Figure 4 schematically shows a network structure diagram of a wake-up syllable recognition model according to an embodiment of the present disclosure
  • Figure 5 schematically shows a flow chart of a voice wake-up method according to another embodiment of the present disclosure
  • Figure 6 schematically shows an application diagram of a voice wake-up method according to another embodiment of the present disclosure
  • Figure 7 schematically shows a block diagram of a voice wake-up device according to an embodiment of the present disclosure.
  • FIG. 8 schematically shows a block diagram of an electronic device suitable for implementing a voice wake-up method according to an embodiment of the present disclosure.
  • the present disclosure provides a voice wake-up method, device, electronic device, storage medium and program product.
  • a voice wake-up method which includes: performing word recognition on the speech to be recognized to obtain a wake-up word recognition result; and determining that the wake-up word recognition result is used to characterize the voice to be recognized, including predetermined wake-up.
  • syllable recognition is performed on the speech to be recognized to obtain an awakening syllable recognition result; and when it is determined that the awakening syllable recognition result is used to represent that the speech to be recognized includes predetermined syllables, the speech to be recognized is determined to be the correct awakening speech.
  • the user's authorization or consent is obtained before obtaining or collecting the user's personal information.
  • Figure 1 schematically shows an application scenario diagram of a voice wake-up method and device according to an embodiment of the present disclosure.
  • Figure 1 is only an example of application scenarios in which embodiments of the present disclosure can be applied, to help those skilled in the art understand the technical content of the present disclosure, but does not mean that the embodiments of the present disclosure cannot be used in other applications.
  • Device, system, environment or scenario
  • the user 102 can send a voice to be recognized to the voice interaction device 101, and the voice interaction device 101 can determine whether the voice to be recognized is the correct wake-up voice.
  • the voice interaction device 101 determines that the voice to be recognized is the correct wake-up voice, , collect the user's command voice containing intention information, and execute the intended operation in the command voice to realize human-computer interaction between the user 102 and the voice interaction device 101 .
  • Various communication client applications can be installed on the voice interaction device 101, such as knowledge reading applications, web browser applications, search applications, instant messaging tools, email clients and/or social platform software (just examples).
  • the voice interaction device 101 may include a sound collector, such as a microphone, to collect the user's 102 voice to be recognized and the instruction voice containing intention information.
  • the voice interaction device 101 may also include a sound player, such as a speaker, to play sounds emitted by the voice interaction device.
  • the voice interaction device 101 may be any electronic device capable of interacting through voice signals.
  • the voice interaction device 101 may include, but is not limited to, a smartphone, a tablet computer, a laptop computer, a smart home appliance, a smart speaker, a car speaker, a smart tutor, or a smart robot, etc.
  • the syllable recognition model and keyword recognition model provided by the embodiment of the present disclosure are loaded in the voice interaction device 101, and the voice processing method can generally be executed by the voice interaction device 101.
  • the voice processing device provided by the embodiment of the present disclosure can also be provided in the voice interaction device 101. Terminal equipment can be used without By interacting with the server, the voice wake-up method and device provided by the embodiments of the present disclosure can be implemented.
  • the voice interaction device can also transmit the voice to be recognized to the server through the network, and use the server to process the voice to be recognized to determine whether the voice to be recognized is the correct wake-up voice.
  • Figure 2 schematically shows a flow chart of a voice wake-up method according to an embodiment of the present disclosure.
  • the method includes operations S210 to S230.
  • word recognition is performed on the speech to be recognized, and a wake-up word recognition result is obtained.
  • the voice to be recognized may be a wake-up voice.
  • the wake-up voice may refer to a voice signal received before the voice interaction function is awakened, such as a voice including a wake-up word or a voice including a non-wake-up word.
  • the correct wake-up voice may refer to: a voice including a wake-up word, or a voice that can wake up the voice interaction function.
  • the voice interaction function of the voice interaction device can be triggered.
  • the operation can be stopped and no response to the user can be made.
  • the voice interaction function may refer to a function that can receive interactive voice from a user and output a voice feedback result corresponding to the interactive voice to the user.
  • performing word recognition on the speech to be recognized may refer to: performing wake-up word recognition on the speech to be recognized. Perform word recognition on the speech to be recognized and obtain the wake-up word recognition result.
  • the wake-up word recognition result can indicate whether the speech to be recognized includes a predetermined wake-up word.
  • performing word recognition on the speech to be recognized is to recognize the speech to be recognized from a global or overall perspective, and obtain a wake-up word recognition result. For example, if the predetermined wake-up word is "little D", then the word to be recognized The sound is "Hello, little D”. By performing word recognition on the speech to be recognized, a wake-up word recognition result that represents the speech to be recognized including a predetermined wake-up word can be obtained.
  • the human-computer interaction function can be turned on. When it is determined that the wake-up word recognition result is used to represent that the speech to be recognized does not include the predetermined wake-up word, it may be determined that the speech to be recognized is an incorrect wake-up speech. No need to respond.
  • syllable recognition can be performed on the speech to be recognized to obtain the wake-up syllable recognition result.
  • performing syllable recognition on the speech to be recognized may refer to: performing syllable recognition on the speech to be recognized corresponding to the wake-up word, and obtaining a wake-up syllable recognition result.
  • the awakening syllable recognition result is used to characterize whether the speech to be recognized includes a predetermined syllable.
  • the predetermined syllable may refer to a syllable corresponding to the predetermined wake-up word.
  • the syllable recognition of the speech to be recognized is to recognize the speech to be recognized locally or from byte units. For example, there are two predetermined syllables corresponding to the predetermined wake-up word " ⁇ D", the syllable " ⁇ ” and the syllable “D”. Then the speech to be recognized is "Hello, little D”. After syllable recognition is performed on the speech to be recognized, A wake-up word recognition result that is used to represent that the speech to be recognized includes predetermined syllables can be obtained.
  • the speech to be recognized when it is determined that the awakening syllable recognition result is used to represent that the speech to be recognized includes a predetermined syllable, the speech to be recognized is determined to be the correct awakening speech.
  • the awakening syllable recognition result is used to represent that the speech to be recognized does not include the predetermined syllable, it is determined that the speech to be recognized is an incorrect awakening speech.
  • word recognition may not be performed on the speech to be recognized, but only syllable recognition may be performed on the speech to be recognized, to obtain an awakening syllable recognition result.
  • the speech to be recognized is the correct awakening speech. For example, when it is determined that the awakening syllable recognition result is used to represent that the speech to be recognized includes a predetermined syllable, the speech to be recognized is determined to be the correct awakening speech.
  • the human-computer interaction function can be turned on. When it is determined that the awakening syllable recognition result is used to represent that the speech to be recognized does not include the predetermined syllable, it is determined that the speech to be recognized is an incorrect awakening speech. No need to respond.
  • using the present disclosure implements The example provided uses "determining wake word recognition results to characterize the speech to be recognized including pre- When the wake-up word is specified, syllable recognition is performed on the speech to be recognized, and the wake-up syllable recognition result is obtained.
  • word recognition is performed on the speech to be recognized to obtain a wake-up word recognition result, which may further include: performing a convolution operation on the speech to be recognized to obtain a first-level feature vector sequence. Perform a gated loop operation on the first-level feature vector sequence to obtain the second-level feature vector sequence. Classify the second-level feature vector sequence to obtain the wake-up word recognition result.
  • the speech to be recognized may include a sequence of speech frames.
  • the first-level feature vector sequence corresponds to the speech frame sequence one-to-one.
  • a wake word recognition model can be used to perform word recognition on the speech to be recognized, and a wake word recognition result can be obtained. But it is not limited to this. You can also use other methods to perform word recognition on the speech to be recognized, as long as it is a word recognition method that can obtain the wake-up word recognition result.
  • Figure 3 schematically shows a network structure diagram of a wake word recognition model according to an embodiment of the present disclosure.
  • the wake word recognition model includes a convolution module 310 , a gated recurrent unit 320 and a wake word classification module 330 in sequence.
  • the speech to be recognized 340 is input into the convolution module 310 to obtain a first-level feature vector sequence.
  • the first-level feature vector sequence is input to the gated loop unit 320 to obtain a second-level feature vector sequence.
  • the second-level feature vector sequence is input to the wake word classification module 330 to obtain the wake word recognition result 350.
  • the convolution module in the wake word recognition model is not limited to one, and may include multiple stacked convolution modules.
  • the wake word recognition model can also include multiple stacked gated recurrent units.
  • the convolution module may include CNN (Convolutional Neural Networks, convolutional neural network), RNN (Recurrent Neural Network, recurrent neural network), LSTM (Long Short-Term Memory, long short-term memory network), etc. One or more combinations.
  • CNN Convolutional Neural Networks, convolutional neural network
  • RNN Recurrent Neural Network, recurrent neural network
  • LSTM Long Short-Term Memory, long short-term memory network
  • the wake word classification module may include a fully connected layer and an activation function.
  • activation function It can be a Softmax activation function, but it is not limited to this. It can also be a Sigmoid activation function.
  • the number of fully connected layers is not limited, for example, it can be one layer or multiple layers.
  • the gated recycling unit may refer to a GRU (Gate Recurrent Unit), but is not limited to this and may also be a GRU-derived module.
  • GRU Gate Recurrent Unit
  • the GRU derivative module is a lightweight version of GRU.
  • GRU-derived modules also known as Projected Light-GRU modules
  • Projected Light-GRU modules is beneficial to loading the wake-up word recognition model on terminal devices such as voice interaction devices, that is, lightweight deployment on the terminal side, thereby ensuring that the speech to be recognized is processed. Real-time recognition.
  • performing a gated loop operation on the first-level feature vector sequence to obtain the second-level feature vector sequence includes repeatedly performing the following operations: based on the output vector at the previous moment and the input vector at the current moment, determine respectively The current time update gate and the current time candidate hidden layer information.
  • the input vector at the current moment is the first-level feature vector at the current moment in the first-level feature vector sequence.
  • the hidden layer information at the previous moment and the update gate at the current moment the hidden layer information at the current moment is determined.
  • the output vector at the current moment is determined.
  • the output vector at the current moment is the second-level feature vector at the current moment in the second-level feature vector sequence.
  • the predetermined parameter also called a mapping (Projection) parameter, is determined based on a lightweight parameter amount threshold.
  • the lightweight parameter amount threshold may refer to the setting benchmark of parameters, such as specifying the parameter amount threshold, and the size of the predetermined parameter is less than or equal to the lightweight parameter amount threshold to reduce the data processing volume of the wake word recognition model.
  • the Projected Light-GRU module can be expressed by the following formulas (1) to (4).
  • z t represents the update gate at time t, the range is (0, 1); ⁇ ( ⁇ ) represents the sigmoid function; g ( ⁇ ) represents the Gaussian error linear unit activation function (such as GELU activation function); BN ( ⁇ ) represents the normalization function Unification function; x t represents the input vector at time t; o t-1 represents the output vector at time t-1; o t represents the output data at time t; w z and u z respectively represent the parameters related to the sigmoid function; w h and u h respectively represent the Parameters related to the GELU activation function; h t-1 represents the hidden layer information at time t-1; h t represents the hidden layer information at time t; w o represents the mapping parameter; Represents the candidate hidden layer information at time t.
  • represents the sigmoid function
  • g ( ⁇ ) represents the Gaussian error linear unit activation function (such as GELU activation function)
  • the Projected Light-GRU module compared with standard GRU, removes the reset gate and introduces predetermined parameters. As a result, the calculation amount of the wake word recognition model is small. Apply the wake word recognition model with the Projected Light-GRU module to voice interaction devices to ensure high performance while reducing resource overhead. In order to realize that the wake-up word recognition model loaded in the voice interaction device is running around the clock, it can improve the wake-up response speed of the voice interaction device.
  • syllable recognition is performed on the speech to be recognized to obtain wake-up syllable recognition.
  • it may further include: extracting syllable features of the speech to be recognized to obtain a syllable feature matrix. Classify the syllable feature matrix to obtain the wake-up syllable recognition result.
  • a syllable recognition model can be used to perform syllable recognition on the speech to be recognized, and a wake-up syllable recognition result can be obtained.
  • Other methods can also be used to perform syllable recognition on the speech to be recognized, as long as it is a syllable recognition method that can obtain a wake-up syllable recognition result.
  • FIG. 4 schematically shows a network structure diagram of a wake-up syllable recognition model according to an embodiment of the present disclosure.
  • the arousal syllable recognition model includes a feature extraction coding module 410 and a syllable classification module 420 in sequence.
  • the speech to be recognized 430 is input into the feature extraction coding module 410 to extract syllable features and output a syllable feature matrix.
  • the syllable feature matrix is input into the syllable classification module 420 to perform the classification operation and output the awakening syllable recognition result 440.
  • the syllable classification module may include a fully connected layer and an activation function.
  • the activation function may be a Softmax activation function, but is not limited to this, and may also be a Sigmoid activation function.
  • the number of fully connected layers is not limited, for example, it can be one layer or multiple layers.
  • the feature extraction coding module can be constructed using the network structure in the Conformer model (encoder based on convolution enhancement), but is not limited to this.
  • the Conformer module in the Conformer model can also be used, or It is a lightweight Conformer model or Conformer module that has been pruned, for example.
  • the network structure obtained after processing.
  • performing syllable feature extraction on the speech to be recognized to obtain a syllable feature matrix may further include: performing feature extraction on the speech to be recognized to obtain a feature matrix. Dimensionally reduce the feature matrix to obtain the reduced feature matrix. The dimensionally reduced feature matrix is subjected to multi-level speech enhancement coding processing to obtain a syllable feature matrix.
  • the feature extraction encoding module may include a feature extraction layer, a dimensionality reduction layer, and an encoding layer in sequence.
  • the feature extraction layer can be used to extract features of the speech to be recognized and obtain a feature matrix.
  • Use the dimensionality reduction layer to reduce the dimensionality of the feature matrix to obtain the dimensionally reduced feature matrix.
  • the coding layer is used to perform multi-level phonetic correct coding processing on the dimensionally reduced feature matrix to obtain the syllable feature matrix.
  • the feature extraction layer may include at least one of the following: at least one relative sinusoidal positional encoding (relative sinusoidal positional encoding) layer, at least one convolutional layer, and at least one feed forward layer (Feed Forward Module).
  • the encoding layer may include a Conformer module, for example, including at least one of the following: multiple feedforward layers, at least one multi-headed self-attention mechanism layer (Multi-Headed Self-Attention module), and at least one convolutional layer.
  • a Conformer module for example, including at least one of the following: multiple feedforward layers, at least one multi-headed self-attention mechanism layer (Multi-Headed Self-Attention module), and at least one convolutional layer.
  • the dimensionality reduction layer may include a mapping function, but is not limited thereto, and may also include other implementations, such as reducing the dimensionality of a high-dimensional matrix to obtain a layer structure of a low-dimensional matrix.
  • the dimensionality reduction layer can be used to reduce the amount of data input to the encoding layer, thereby reducing the calculation amount of the syllable recognition model.
  • the number of stacked layers of the coding layer can also be reduced.
  • the number of stacked layers of the coding layer can be determined to be any one from 1 to 4 according to the lightweight parameter amount threshold.
  • the wake-up syllable recognition model can be lightweight and miniaturized while ensuring recognition accuracy, thereby improving recognition. efficiency, and when the wake-up syllable recognition model is applied to a terminal device, the internal consumption of the processor of the terminal device can be reduced.
  • FIG. 5 schematically shows a flow chart of a voice wake-up method according to another embodiment of the present disclosure.
  • the speech to be recognized 510 is input into the wake word recognition model 520, and a wake word recognition result 530 is obtained.
  • the speech to be recognized 510 is input into the wake syllable recognition model 540 to obtain the wake syllable recognition result 550 .
  • the awakening syllable recognition result 550 is used to represent that the speech to be recognized includes a predetermined syllable, it is determined that the speech to be recognized is a correct awakening speech.
  • the voice interaction device is awakened and subsequent human-computer interaction can be performed.
  • the operation is stopped. If it is determined that the wake-up syllable recognition result is used to represent that the speech to be recognized does not include the predetermined syllable, it is determined that the speech to be recognized is an incorrect wake-up speech, and the voice interaction device is not awakened.
  • the speech to be recognized can also be input into the wake-up syllable recognition model to obtain the wake-up syllable recognition result.
  • the speech to be recognized is input into the awakening word recognition model to obtain the awakening word recognition result.
  • the wake-up word recognition result is used to represent that the speech to be recognized includes the predetermined wake-up word
  • it is determined that the speech to be recognized is the correct wake-up speech.
  • the voice interaction device is awakened and subsequent human-computer interaction can be performed.
  • the wake-up syllable recognition result is used to represent that the speech to be recognized does not include the predetermined syllable, it is determined that the speech to be recognized is an incorrect wake-up speech, and the operation is stopped.
  • the wake-up word recognition result is used to represent that the speech to be recognized includes a predetermined wake-up word, it is determined that the speech to be recognized is an incorrect wake-up speech, and the voice interaction device is not awakened.
  • the speech to be recognized can also be input to the wake word recognition model to obtain the wake word recognition result. Input the speech to be recognized into the wake-up syllable recognition model to obtain the wake-up syllable recognition result.
  • the wake-up word recognition result is used to represent that the speech to be recognized includes the predetermined wake-up word
  • the syllable recognition result is used to represent that the speech to be recognized includes the predetermined syllable
  • it is determined that the speech to be recognized is the correct wake-up speech.
  • the wake-up word recognition result is used to represent that the speech to be recognized does not include the predetermined wake-up word, or it is determined that the syllable recognition result is used to represent that the speech to be recognized includes non-predetermined syllables, it is determined that the speech to be recognized is an incorrect wake-up speech.
  • the above wake-up word recognition model and wake-up syllable recognition model are used to process the speech to be recognized, which can be applied in scenarios where the number of wake-up words is reduced.
  • the wake-up word is 1 character, 2 characters, or 3 characters In this case, it can ensure the recognition accuracy and reduce the false alarm rate at the same time.
  • the method of "first using the wake-up syllable model to perform syllable recognition on the speech to be recognized”, or with the method of "simultaneously using the wake-up syllable model to perform syllable recognition on the speech to be recognized and using the wake-up word recognition model to perform syllable recognition on the speech to be recognized” Compared with the method of "word recognition”, the method of "first using the wake-up word recognition model to perform word recognition on the speech to be recognized", with the features of the wake-up word recognition model's simple network structure and small computational load, can enable the terminal device to be activated in real time. In this case, the internal consumption of the voice interaction device as a terminal device can be reduced while ensuring the recognition accuracy.
  • FIG. 6 schematically shows an application diagram of a voice wake-up method according to another embodiment of the present disclosure.
  • the user 610 sends a voice to be recognized to the voice interaction device 620.
  • the voice interaction device uses the wake-up word recognition model and wake-up syllable recognition model loaded in the voice interaction device 610 to perform the voice wake-up method operation on the speech to be recognized, and determine whether the speech to be recognized is the correct wake-up speech.
  • the target object 630 is displayed on the display interface 621 of the voice interaction device 620, and at the same time, the voice interaction device 620 outputs feedback voice. To flexibly and vividly reflect human-computer interaction.
  • Figure 7 schematically shows a block diagram of a voice wake-up device according to an embodiment of the present disclosure.
  • the voice awakening device 700 includes: a word recognition module 710 , a syllable recognition module 720 and a first determination module 730 .
  • the word recognition module 710 is used to perform word recognition on the speech to be recognized and obtain the wake-up word recognition result.
  • the syllable recognition module 720 is configured to perform syllable recognition on the speech to be recognized and obtain the wake-up syllable recognition result when it is determined that the wake-up word recognition result is used to represent that the speech to be recognized includes a predetermined wake-up word.
  • the first determination module 730 is configured to determine that the speech to be recognized is the correct awakening speech when it is determined that the awakening syllable recognition result is used to represent that the speech to be recognized includes a predetermined syllable.
  • the word recognition module includes: a convolution unit, a gating unit, and a word classification unit.
  • the convolution unit is used to perform a convolution operation on the speech to be recognized to obtain the first-level feature vector sequence.
  • the speech to be recognized includes a speech frame sequence, and the first-level feature vector sequence corresponds to the speech frame sequence one-to-one.
  • the gating unit is used to perform gating loop operations on the first-level feature vector sequence to obtain the second-level feature vector sequence.
  • the word classification unit is used to classify the second-level feature vector sequence to obtain wake-up word recognition results.
  • the gating unit includes the following repeated subunits:
  • the first determination subunit is used to respectively determine the update gate at the current moment and the candidate hidden layer information at the current moment based on the output vector at the previous moment and the input vector at the current moment, where the input vector at the current moment is the current one in the first-level feature vector sequence.
  • the first level eigenvector of time is used to respectively determine the update gate at the current moment and the candidate hidden layer information at the current moment based on the output vector at the previous moment and the input vector at the current moment, where the input vector at the current moment is the current one in the first-level feature vector sequence.
  • the second determination subunit is used to determine the hidden layer information at the current moment based on the candidate hidden layer information at the current moment, the hidden layer information at the previous moment and the update gate at the current moment.
  • the third determination subunit is used to determine the output vector at the current moment based on the hidden layer information and predetermined parameters at the current moment, where the output vector at the current moment is the second-level feature vector at the current moment in the second-level feature vector sequence. quantity.
  • the syllable recognition module includes: an extraction unit and a syllable classification unit.
  • the extraction unit is used to extract syllable features of the speech to be recognized and obtain a syllable feature matrix.
  • the syllable classification unit is used to perform classification operations on the syllable feature matrix to obtain wake-up syllable recognition results.
  • the extraction unit includes: an extraction subunit, a dimensionality reduction subunit, and an encoding subunit.
  • the extraction subunit is used to extract features of the speech to be recognized and obtain a feature matrix.
  • the dimensionality reduction subunit is used to reduce the dimensionality of the feature matrix and obtain the dimensionally reduced feature matrix.
  • the encoding subunit is used to perform multi-level speech enhancement encoding processing on the dimensionally reduced feature matrix to obtain the syllable feature matrix.
  • the voice wake-up device further includes: a second determination module.
  • the second determination module is configured to determine that the speech to be recognized is an incorrect wake-up speech when it is determined that the wake-up word recognition result is used to represent that the speech to be recognized does not include the predetermined wake-up word.
  • the predetermined parameter is determined based on a lightweight parameter amount threshold.
  • the voice wake-up device further includes: a display module and a feedback module.
  • the display module is used to display the target object on the display interface when it is determined that the voice to be recognized is the correct wake-up voice.
  • Feedback module used to output feedback voice.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • an electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions that can be executed by at least one processor, and the instructions are processed by at least one processor.
  • the processor executes, so that at least one processor can execute the method according to the embodiment of the present disclosure.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to perform a method according to an embodiment of the present disclosure.
  • a computer program product includes a computer program, and when executed by a processor, the computer program implements a method according to an embodiment of the present disclosure.
  • FIG. 8 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure.
  • electronic Device is intended to mean various forms of digital computers, such as laptops, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 800 includes a computing unit 801 that can execute according to a computer program stored in a read-only memory (ROM) 802 or loaded from a storage unit 808 into a random access memory (RAM) 803 Various appropriate actions and treatments. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored.
  • Computing unit 801, ROM 802 and RAM 803 are connected to each other via bus 804.
  • An input/output (I/O) interface 805 is also connected to bus 804.
  • the I/O interface 805 includes: an input unit 806, such as a keyboard, a mouse, etc.; an output unit 807, such as various types of displays, speakers, etc.; a storage unit 808, such as a magnetic disk, optical disk, etc. ; and communication unit 809, such as a network card, modem, wireless communication transceiver, etc.
  • the communication unit 809 allows the device 800 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
  • Computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc.
  • the computing unit 801 performs various methods and processes described above, such as the voice wake-up method.
  • the voice wake-up method may be implemented as a computer software program that is tangibly embodied in a machine-readable medium, such as storage unit 808.
  • part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809.
  • the computer program When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the voice wake-up method described above may be performed.
  • the computing unit 801 may be configured to perform the voice wake-up method in any other suitable manner (eg, by means of firmware).
  • Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), Implemented in a system-on-a-chip (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system-on-a-chip
  • CPLD complex programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor
  • the processor which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in computing systems that include backend components (e.g., as data services server), or a computing system including a middleware component (e.g., an application server), or a computing system including a front-end component (e.g., a user computer with a graphical user interface or a web browser through which the user can A web browser to interact with implementations of the systems and techniques described herein), or in a computing system that includes any combination of such backend components, middleware components, or front-end components.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.
  • Computer systems may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact over a communications network.
  • the relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
  • the server can be a cloud server, a distributed system server, or a server combined with a blockchain.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Electric Clocks (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音唤醒方法、装置、电子设备、存储介质以及程序产品,该方法包括:对待识别语音进行词识别,得到唤醒词识别结果(S210);在确定唤醒词识别结果用于表征待识别语音包括预定唤醒词的情况下,对待识别语音进行音节识别,得到唤醒音节识别结果(S220);以及在确定唤醒音节识别结果用于表征待识别语音包括预定音节的情况下,确定待识别语音为正确唤醒语音(S230)。

Description

语音唤醒方法、装置、电子设备以及存储介质
本申请要求于2022年7月15日提交的、申请号为202210838284.6的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及人工智能技术领域,尤其涉及人机交互、深度学习、智能语音等技术领域。具体涉及语音唤醒方法、装置、电子设备、存储介质以及程序产品。
背景技术
语音交互是一种人类自然的交互方式。随着人工智能技术的不断发展,已经实现让机器能够听懂人类发出的语音,理解语音中的内在含义,并作出相应的反馈。在这些操作中,唤醒的响应速度、唤醒的难易程度、语义的准确理解、反馈的迅速程度均成为影响语音交互顺畅的因素。
发明内容
本公开提供了一种语音唤醒方法、装置、电子设备、存储介质以及程序产品。
根据本公开的一方面,提供了一种语音唤醒方法,包括:对待识别语音进行词识别,得到唤醒词识别结果;在确定上述唤醒词识别结果用于表征上述待识别语音包括预定唤醒词的情况下,对上述待识别语音进行音节识别,得到唤醒音节识别结果;以及在确定上述唤醒音节识别结果用于表征上述待识别语音包括预定音节的情况下,确定上述待识别语音为正确唤醒语音。
根据本公开的另一方面,提供了一种语音唤醒装置,包括:词识别模块,用于对待识别语音进行词识别,得到唤醒词识别结果;音节识别模块,用于在确定上述唤醒词识别结果用于表征上述待识别语音包括预定唤醒词的情况下,对上述待识别语音进行音节识别,得到唤醒音节识别结果;以及第一确定模块,用于在确定上述唤醒音节识别结果用于表征上述待识别语音包括预定音节的情况下,确定上述待识别语音为正确唤醒语音。
根据本公开的另一方面,提供了一种电子设备,包括:至少一个处理器;以及与上述至少一个处理器通信连接的存储器;其中,上述存储器存储有可被上述至少一个处理 器执行的指令,上述指令被上述至少一个处理器执行,以使上述至少一个处理器能够执行如本公开的方法。
根据本公开的另一方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,上述计算机指令用于使上述计算机执行如本公开的方法。
根据本公开的另一方面,提供了一种计算机程序产品,包括计算机程序,上述计算机程序在被处理器执行时实现如本公开的方法。
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。
附图说明
附图用于更好地理解本方案,不构成对本公开的限定。其中:
图1示意性示出了根据本公开实施例的可以应用语音唤醒方法及装置的示例性系统架构;
图2示意性示出了根据本公开实施例的语音唤醒方法的流程图;
图3示意性示出了根据本公开实施例的唤醒词识别模型的网络结构图;
图4示意性示出了根据本公开实施例的唤醒音节识别模型的网络结构图;
图5示意性示出了根据本公开另一实施例的语音唤醒方法的流程示意图;
图6示意性示出了根据本公开另一实施例的语音唤醒方法的应用示意图;
图7示意性示出了根据本公开实施例的语音唤醒装置的框图;以及
图8示意性示出了根据本公开实施例的适于实现语音唤醒方法的电子设备的框图。
具体实施方式
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。
本公开提供了一种语音唤醒方法、装置、电子设备、存储介质以及程序产品。
根据本公开的一方面,提供了一种语音唤醒方法,包括:对待识别语音进行词识别,得到唤醒词识别结果;在确定唤醒词识别结果用于表征待识别语音包括预定唤醒 词的情况下,对待识别语音进行音节识别,得到唤醒音节识别结果;以及在确定唤醒音节识别结果用于表征待识别语音包括预定音节的情况下,确定待识别语音为正确唤醒语音。
在本公开的技术方案中,所涉及的用户个人信息的收集、存储、使用、加工、传输、提供、公开和应用等处理,均符合相关法律法规的规定,采取了必要保密措施,且不违背公序良俗。
在本公开的技术方案中,在获取或采集用户个人信息之前,均获取了用户的授权或同意。
图1示意性示出了根据本公开实施例的语音唤醒方法及装置的应用场景图。
需要注意的是,图1所示仅为可以应用本公开实施例的应用场景的示例,以帮助本领域技术人员理解本公开的技术内容,但并不意味着本公开实施例不可以用于其他设备、系统、环境或场景。
如图1所示,用户102可以向语音交互设备101发出待识别语音,语音交互设备101可以判断待识别语音是否为正确唤醒语音,语音交互设备101在确定待识别语音是正确唤醒语音的情况下,采集用户的包含意图信息的指令语音,并执行指令语音中的意图操作,实现用户102与语音交互设备101之间的人机交互。
语音交互设备101上可以安装有各种通讯客户端应用,例如知识阅读类应用、网页浏览器应用、搜索类应用、即时通信工具、邮箱客户端和/或社交平台软件等(仅为示例)。
语音交互设备101可以包括声音采集器,例如麦克风,来采集用户102的待识别语音和包含意图信息的指令语音。语音交互设备101同时还可以包括声音播放器,例如扬声器,来播放语音交互设备发出的声音。
语音交互设备101可以是任何能够通过语音信号进行交互的电子设备。语音交互设备101可以包括但不限于智能手机、平板电脑、膝上型便携计算机、智能家电、智能音箱、车载音箱、智能家教机或者智能机器人等等。
需要说明的是,本公开实施例所提供的音节识别模型和关键词识别模型装载于语音交互设备101中,语音处理方法一般可以由语音交互设备101执行。相应地,本公开实施例所提供的语音处理装置也可以设置于语音交互设备101中。终端设备可以无需与 服务器进行交互,即可实现本公开实施例提供的语音唤醒方法及装置。
但是并不局限于此,在本公开的另一实施例中,语音交互设备也可以通过网络将待识别语音传输给服务器,利用服务器对待识别语音进行处理,确定待识别语音是否是正确唤醒语音。
应注意,以下方法中各个操作的序号仅作为该操作的表示以便描述,而不应被看作表示该各个操作的执行顺序。除非明确指出,否则该方法不需要完全按照所示顺序来执行。
图2示意性示出了根据本公开实施例的语音唤醒方法的流程图。
如图2所示,该方法包括操作S210~S230。
在操作S210,对待识别语音进行词识别,得到唤醒词识别结果。
在操作S220,在确定唤醒词识别结果用于表征待识别语音包括预定唤醒词的情况下,对待识别语音进行音节识别,得到唤醒音节识别结果。
在操作S230,在确定唤醒音节识别结果用于表征待识别语音包括预定音节的情况下,确定待识别语音为正确唤醒语音。
根据本公开的实施例,待识别语音可以为唤醒语音。唤醒语音可以是指在语音交互功能未唤醒前接收到的语音信号,例如包括唤醒词的语音、或者包括非唤醒词的语音。
根据本公开的实施例,正确唤醒语音可以指:包括唤醒词的语音,或者指能够唤醒语音交互功能的语音。在确定待识别语音为正确唤醒语音的情况下,可以触发语音交互设备的语音交互功能。在确定待识别语音为错误唤醒语音的情况下,可以停止操作,对用户不做出响应。
根据本公开的实施例,语音交互功能可以是指:能够接收来自用户发出的交互语音,并且能够向用户输出与交互语音相对应的语音反馈结果的功能。
根据本公开的实施例,对待识别语音进行词识别,可以指:对待识别语音进行唤醒词的识别。对待识别语音进行词识别,得到唤醒词识别结果。唤醒词识别结果可以表征待识别语音中是否包括预定唤醒词。
根据本公开的实施例,对待识别语音进行词识别,为从全局或者指从整体方面对待识别语音进行识别,得到唤醒词识别结果。例如,预定唤醒词为“小D”,则待识别语 音为“小D你好”,经过对待识别语音进行词识别,能够得到用于表征待识别语音包括预定唤醒词的唤醒词识别结果。
根据本公开的其他实施例,可以基于唤醒词识别结果,确定待识别语音是否为正确唤醒语音。例如,在确定唤醒词识别结果用于表征待识别语音包括预定唤醒词的情况下,可以确定待识别语音为正确唤醒语音。可以开启人机交互功能。在确定唤醒词识别结果用于表征待识别语音不包括预定唤醒词的情况下,可以确定待识别语音为错误唤醒语音。可以不做出响应。
根据本公开的实施例,在确定唤醒词识别结果用于表征待识别语音包括预定唤醒词的情况下,可以对待识别语音进行音节识别,得到唤醒音节识别结果。
根据本公开的实施例,对待识别语音进行音节识别,可以指:对待识别语音进行与唤醒词相对应的音节识别,得到唤醒音节识别结果。唤醒音节识别结果用于表征待识别语音是否包括预定音节。该预定音节可以指与预定唤醒词相对应的音节。
根据本公开的实施例,对待识别语音进行音节识别,为从局部或者指从字节单元对待识别语音进行识别。例如,与预定唤醒词为“小D”相对应的预定音节为2个,音节“小”和音节“D”,则待识别语音为“小D你好”,经过对待识别语音进行音节识别,能够得到用于表征待识别语音包括预定音节的唤醒词识别结果。
根据本公开的实施例,在确定唤醒音节识别结果用于表征待识别语音包括预定音节的情况下,确定待识别语音为正确唤醒语音。在确定唤醒音节识别结果用于表征待识别语音不包括预定音节的情况下,确定待识别语音为错误唤醒语音。
根据本公开的其他实施例,可以不对待识别语音进行词识别,仅对待识别语音进行音节识别,得到唤醒音节识别结果。基于唤醒音节识别结果,确定待识别语音是否为正确唤醒语音。例如,在确定唤醒音节识别结果用于表征待识别语音包括预定音节的情况下,确定待识别语音为正确唤醒语音。可以开启人机交互功能。在确定唤醒音节识别结果用于表征待识别语音不包括预定音节的情况下,确定待识别语音为错误唤醒语音。可以不做出响应。
根据本公开的实施例,与仅基于唤醒词识别结果确定待识别语音是否为正确唤醒语音,或者与仅基于唤醒音节识别结果确定待识别语音是否为正确唤醒语音的方式相比,利用本公开实施例提供的利用“在确定唤醒词识别结果用于表征待识别语音包括预 定唤醒词的情况下,对待识别语音进行音节识别,得到唤醒音节识别结果。基于唤醒音节识别结果,确定待识别语音是否为正确唤醒语音”的方式,可以利用词识别操作对待识别语音进行唤醒词的整词单元识别,与此同时,利用音节识别操作对待识别语音进行唤醒词的字单元识别,进而能够对待识别语音在全局以及局部两方面进行识别,由此实现在唤醒词的字数目少于4个,例如3个或者2个的情况下,能够保证唤醒精度和避免唤醒误报的问题。
根据本公开的另一实施例,针对如图2所示的操作S210,待识别语音进行词识别,得到唤醒词识别结果,可以进一步包括:对待识别语音进行卷积操作,得到第一级特征向量序列。对第一级特征向量序列进行门控循环操作,得到第二级特征向量序列。对第二级特征向量序列进行分类操作,得到唤醒词识别结果。
根据本公开的实施例,待识别语音可以包括语音帧序列。第一级特征向量序列与语音帧序列一一对应。
根据本公开的实施例,可以利用唤醒词识别模型对待识别语音进行词识别,得到唤醒词识别结果。但是并不局限于此。还可以利用其它方式对待识别语音进行词识别,只要是能够得到唤醒词识别结果的词识别方法即可。
图3示意性示出了根据本公开实施例的唤醒词识别模型的网络结构图。
如图3所示,唤醒词识别模型依序包括卷积模块310、门控循环单元320和唤醒词分类模块330。
如图3所示,将待识别语音340输入至卷积模块310中,得到第一级特征向量序列。将第一级特征向量序列输入至门控循环单元320,得到第二级特征向量序列。将第二级特征向量序列输入至唤醒词分类模块330,得到唤醒词识别结果350。
根据本公开的实施例,唤醒词识别模型中的卷积模块并不局限于一个,可以包括多个堆叠的卷积模块。类似地,唤醒词识别模型也可以包括多个堆叠的门控循环单元。
根据本公开的实施例,卷积模块可以包括CNN(Convolutional Neural Networks,卷积神经网络)、RNN(Recurrent Neural Network,循环神经网络)、LSTM(Long Short-TermMemory,长短期记忆网络)等中的一种或多种组合。
根据本公开的实施例,唤醒词分类模块可以包括全连接层和激活函数。激活函数 可以是Softmax激活函数,但是并不局限于此,还可以是Sigmoid激活函数。全连接层的层数不做限定,例如可以是一层,也可以是多层。
根据本公开的实施例,门控循环单元可以指GRU(Gate Recurrent Unit,门控循环单元),但是并不局限于此,还可以是GRU衍生模块。例如,对GRU进行了轻量化处理后的GRU衍生模块。
根据本公开的实施例,利用GRU衍生模块,也称Projected Light-GRU模块,有利于将唤醒词识别模型装载于终端设备例如语音交互设备,即终端侧轻量化部署,进而保证对待识别语音进行词识别的实时性。
根据本公开的另一实施例,对第一级特征向量序列进行门控循环操作,得到第二级特征向量序列,包括重复执行以下操作:基于上一时刻输出向量和当前时刻输入向量,分别确定当前时刻更新门和当前时刻候选隐藏层信息。当前时刻输入向量为第一级特征向量序列中的当前时刻的第一级特征向量。基于当前时刻候选隐藏层信息、上一时刻隐藏层信息和当前时刻更新门,确定当前时刻隐藏层信息。基于当前时刻隐藏层信息和预定参数,确定当前时刻输出向量。当前时刻输出向量为第二级特征向量序列中的当前时刻的第二级特征向量。
根据本公开的实施例,预定参数,也称映射(Projection)参数,是基于轻量化参数量阈值确定的。
根据本公开的实施例,轻量化参数量阈值可以指参数的设定基准,例如规定参数量阈值,预定参数的大小小于或者等于轻量化参数量阈值,以降低唤醒词识别模型的数据处理量。
根据本公开的实施例,Projected Light-GRU模块可以通过如下公式(1)~(4)表示。
zt=σ(BN(wzxt)+uzot-1);公式(1)

ot=woht;公式(4)
其中,zt表示t时刻更新门,范围为(0,1);δ(·)表示sigmoid函数;g(·)表示高斯误差线性单元激活函数(例如GELU激活函数);BN(·)表示归一化函数; xt表示t时刻输入向量;ot-1表示t-1时刻输出向量;ot表示t时刻输出数据;wz和uz分别表示与sigmoid函数相关的参数;wh和uh分别表示与GELU激活函数相关的参数;ht-1表示t-1时刻隐藏层信息;ht表示t时刻隐藏层信息;wo表示映射参数;表示t时刻候选隐藏层(candidate hidden layer)信息。
根据本公开的实施例,与标准的GRU相比,本公开实施例提供的Projected Light-GRU模块,移除了重置门,并且引入了预定参数。由此使得唤醒词识别模型的计算量小。将具有Projected Light-GRU模块的唤醒词识别模型应用于语音交互设备中,在保证拥有高性能的同时,降低资源开销。以实现装载于语音交互设备中的唤醒词识别模型全天候处于运行状态,提高语音交互设备的唤醒响应速度。
根据本公开的另一实施例,针对如图2所示的操作S220,在确定唤醒词识别结果用于表征待识别语音包括预定唤醒词的情况下,对待识别语音进行音节识别,得到唤醒音节识别结果,可以进一步包括:对待识别语音进行音节特征提取,得到音节特征矩阵。对音节特征矩阵进行分类操作,得到唤醒音节识别结果。
根据本公开的实施例,可以利用音节识别模型对待识别语音进行音节识别,得到唤醒音节识别结果。但是并不局限于此。还可以利用其它方式对待识别语音进行音节识别,只要是能够得到唤醒音节识别结果的音节识别方法即可。
图4示意性示出了根据本公开实施例的唤醒音节识别模型的网络结构图。
如图4所示,唤醒音节识别模型依序包括特征提取编码模块410和音节分类模块420。
如图4所示,将待识别语音430输入至特征提取编码模块410中,以进行音节特征提取,输出音节特征矩阵。将音节特征矩阵输入至音节分类模块420中,以进行分类操作,输出唤醒音节识别结果440。
根据本公开的实施例,音节分类模块可以包括全连接层和激活函数。激活函数可以是Softmax激活函数,但是并不局限于此,还可以是Sigmoid激活函数。全连接层的层数不做限定,例如可以是一层,也可以是多层。
根据本公开的实施例,特征提取编码模块可以采用Conformer模型(基于卷积增强的编码器)中的网络结构构建得到,但是并不局限于此,还可以采用Conformer模型中的Conformer模块,还可以是Conformer模型或者Conformer模块经过例如剪枝的轻量 化处理后得到的网络结构。
根据本公开的实施例,对待识别语音进行音节特征提取,得到音节特征矩阵,可以进一步包括:对待识别语音进行特征提取,得到特征矩阵。对特征矩阵进行降维,得到降维后的特征矩阵。对降维后的特征矩阵进行多级语音增强的编码处理,得到音节特征矩阵。
根据本公开的实施例,特征提取编码模块可以依序包括特征提取层、降维层和编码层。可以利用特征提取层来对待识别语音进行特征提取,得到特征矩阵。利用降维层对特征矩阵进行降维,得到降维后的特征矩阵。利用编码层对降维后的特征矩阵进行多级语音正确的编码处理,得到音节特征矩阵。
根据本公开的实施例,特征提取层可以包括以下至少一项:至少一个相对正弦位置编码(relative sinusoidal positional encoding)层、至少一个卷积层、至少一个前馈层(Feed Forward Module)。
根据本公开的实施例,编码层可以包括Conformer模块,例如包括以下至少一项:多个前馈层、至少一个多头注意力机制层(Multi-Headed Self-Attention module)、至少一个卷积层。
根据本公开的实施例,降维层可以包括映射函数,但是并不局限于此,还可以包括其他实现例如将高维矩阵进行降维,得到低维矩阵的层结构。
根据本公开的实施例,利用降维层可以将输入至编码层的数据量减小,进而降低音节识别模型的计算量。此外,还可以降低编码层的堆叠层数,例如,根据轻量化参数量阈值,确定编码层的堆叠层数为1至4中的任一。
根据本公开的实施例,可以通过在唤醒音节识别模型中设计降维层以及控制编码层的堆叠层数,实现在保证识别精度的同时,唤醒音节识别模型的轻量化、小型化,进而提高识别效率,且将唤醒音节识别模型应用于终端设备的情况下,可以降低终端设备的处理器的内耗。
图5示意性示出了根据本公开另一实施例的语音唤醒方法的流程示意图。
如图5所示,将待识别语音510输入至唤醒词识别模型520中,得到唤醒词识别结果530。在确定唤醒词识别结果530用于表征待识别语音510包括预定唤醒词的情况下,将待识别语音510输入至唤醒音节识别模型540中,得到唤醒音节识别结果550。 在确定唤醒音节识别结果550用于表征待识别语音包括预定音节的情况下,确定待识别语音为正确唤醒语音。语音交互设备被唤醒,可以进行后续人机交互。在确定唤醒词识别结果用于表征待识别语音不包括预定唤醒词的情况下,确定待识别语音为错误唤醒语音,则停止操作。在确定唤醒音节识别结果用于表征待识别语音不包括预定音节的情况下,则确定待识别语音为错误唤醒语音,语音交互设备不被唤醒。
根据本公开的其他实施例,也可以将待识别语音输入至唤醒音节识别模型中,得到唤醒音节识别结果。在确定唤醒音节识别结果用于表征待识别语音包括预定音节的情况下,将待识别语音输入至唤醒词识别模型中,得到唤醒词识别结果。在确定唤醒词识别结果用于表征待识别语音包括预定唤醒词的情况下,确定待识别语音为正确唤醒语音。语音交互设备被唤醒,可以进行后续人机交互。在确定唤醒音节识别结果用于表征待识别语音不包括预定音节的情况下,确定待识别语音为错误唤醒语音,停止操作。在确定唤醒词识别结果用于表征待识别语音包括预定唤醒词的情况下,确定待识别语音为错误唤醒语音,语音交互设备不被唤醒。
根据本公开的另一其他实施例,还可以将待识别语音输入至唤醒词识别模型,得到唤醒词识别结果。将待识别语音输入至唤醒音节识别模型中,得到唤醒音节识别结果。在确定唤醒词识别结果用于表征待识别语音包括预定唤醒词,且确定音节识别结果用于表征待识别语音包括预定音节的情况下,确定待识别语音为正确唤醒语音。在确定唤醒词识别结果用于表征待识别语音不包括预定唤醒词,或者确定音节识别结果用于表征待识别语音包括不预定音节的情况下,确定待识别语音为错误唤醒语音。
根据本公开的实施例,利用上述唤醒词识别模型和唤醒音节识别模型来处理待识别语音,能够应用于降低唤醒词数目的场景中,在唤醒词为1个字、2个字或者3个字的情况下,能够保证识别精度的同时,降低误报率。
根据本公开的实施例,与“先利用唤醒音节模型对待识别语音进行音节识别”的方式相比、或者与“同时利用唤醒音节模型对待识别语音进行音节识别和利用唤醒词识别模型对待识别语音进行词识别”的方式相比,“先利用唤醒词识别模型对待识别语音进行词识别”的方式,以唤醒词识别模型的网络结构简单、运算量小的特点,能够使得终端设备在处于实时激活状态的情况下,由此能够在保证识别精度的同时,降低语音交互设备作为终端设备的内耗。
图6示意性示出了根据本公开另一实施例的语音唤醒方法的应用示意图。
如图6所示,用户610向语音交互设备620发出待识别语音。语音交互设备利用装载于语音交互设备610中的唤醒词识别模型和唤醒音节识别模型,对待识别语音进行语音唤醒方法的操作,确定待识别语音是否为正确唤醒语音。在确定待识别语音为正确唤醒语音的情况下,在语音交互设备620的显示界面621上展示目标对象630,与此同时,语音交互设备620输出反馈语音。以灵活、生动地体现人机交互。
图7示意性示出了根据本公开实施例的语音唤醒装置的框图。
如图7所示,语音唤醒装置700包括:词识别模块710、音节识别模块720以及第一确定模块730。
词识别模块710,用于对待识别语音进行词识别,得到唤醒词识别结果。
音节识别模块720,用于在确定唤醒词识别结果用于表征待识别语音包括预定唤醒词的情况下,对待识别语音进行音节识别,得到唤醒音节识别结果。
第一确定模块730,用于在确定唤醒音节识别结果用于表征待识别语音包括预定音节的情况下,确定待识别语音为正确唤醒语音。
根据本公开的实施例,词识别模块包括:卷积单元、门控单元以及词分类单元。
卷积单元,用于对待识别语音进行卷积操作,得到第一级特征向量序列。待识别语音包括语音帧序列,第一级特征向量序列与语音帧序列一一对应。
门控单元,用于对第一级特征向量序列进行门控循环操作,得到第二级特征向量序列。
词分类单元,用于对第二级特征向量序列进行分类操作,得到唤醒词识别结果。
根据本公开的实施例,门控单元,包括重复的以下子单元:
第一确定子单元,用于基于上一时刻输出向量和当前时刻输入向量,分别确定当前时刻更新门和当前时刻候选隐藏层信息,其中,当前时刻输入向量为第一级特征向量序列中的当前时刻的第一级特征向量。
第二确定子单元,用于基于当前时刻候选隐藏层信息、上一时刻隐藏层信息和当前时刻更新门,确定当前时刻隐藏层信息。
第三确定子单元,用于基于当前时刻隐藏层信息和预定参数,确定当前时刻输出向量,其中,当前时刻输出向量为第二级特征向量序列中的当前时刻的第二级特征向 量。
根据本公开的实施例,音节识别模块包括:提取单元以及音节分类单元。
提取单元,用于对待识别语音进行音节特征提取,得到音节特征矩阵。
音节分类单元,用于对音节特征矩阵进行分类操作,得到唤醒音节识别结果。
根据本公开的实施例,提取单元包括:提取子单元、降维子单元以及编码子单元。
提取子单元,用于对待识别语音进行特征提取,得到特征矩阵。
降维子单元,用于对特征矩阵进行降维,得到降维后的特征矩阵。
编码子单元,用于对降维后的特征矩阵进行多级语音增强的编码处理,得到音节特征矩阵。
根据本公开的实施例,语音唤醒装置还包括:第二确定模块。
第二确定模块,用于在确定唤醒词识别结果用于表征待识别语音不包括预定唤醒词的情况下,确定待识别语音为错误唤醒语音。
根据本公开的实施例,预定参数是基于轻量化参数量阈值确定的。
根据本公开的实施例,语音唤醒装置还包括:显示模块以及反馈模块。
显示模块,用于在确定待识别语音为正确唤醒语音的情况下,在显示界面展示目标对象。
反馈模块,用于输出反馈语音。
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。
根据本公开的实施例,一种电子设备,包括:至少一个处理器;以及与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行如本公开实施例的方法。
根据本公开的实施例,一种存储有计算机指令的非瞬时计算机可读存储介质,其中,计算机指令用于使计算机执行如本公开实施例的方法。
根据本公开的实施例,一种计算机程序产品,包括计算机程序,计算机程序在被处理器执行时实现如本公开实施例的方法。
图8示出了可以用来实施本公开的实施例的示例电子设备800的示意性框图。电子 设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。
如图8所示,设备800包括计算单元801,其可以根据存储在只读存储器(ROM)802中的计算机程序或者从存储单元808加载到随机访问存储器(RAM)803中的计算机程序,来执行各种适当的动作和处理。在RAM 803中,还可存储设备800操作所需的各种程序和数据。计算单元801、ROM 802以及RAM 803通过总线804彼此相连。输入/输出(I/O)接口805也连接至总线804。
设备800中的多个部件连接至I/O接口805,包括:输入单元806,例如键盘、鼠标等;输出单元807,例如各种类型的显示器、扬声器等;存储单元808,例如磁盘、光盘等;以及通信单元809,例如网卡、调制解调器、无线通信收发机等。通信单元809允许设备800通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。
计算单元801可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元801的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元801执行上文所描述的各个方法和处理,例如语音唤醒方法。例如,在一些实施例中,语音唤醒方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元808。在一些实施例中,计算机程序的部分或者全部可以经由ROM 802和/或通信单元809而被载入和/或安装到设备800上。当计算机程序加载到RAM 803并由计算单元801执行时,可以执行上文描述的语音唤醒方法的一个或多个步骤。备选地,在其他实施例中,计算单元801可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行语音唤醒方法。
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、 芯片上系统的系统(SOC)、复杂可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服 务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,也可以是分布式系统的服务器,或者是结合了区块链的服务器。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。

Claims (19)

  1. 一种语音唤醒方法,包括:
    对待识别语音进行词识别,得到唤醒词识别结果;
    在确定所述唤醒词识别结果用于表征所述待识别语音包括预定唤醒词的情况下,对所述待识别语音进行音节识别,得到唤醒音节识别结果;以及
    在确定所述唤醒音节识别结果用于表征所述待识别语音包括预定音节的情况下,确定所述待识别语音为正确唤醒语音。
  2. 根据权利要求1所述的方法,其中,所述待识别语音进行词识别,得到唤醒词识别结果,包括:
    对所述待识别语音进行卷积操作,得到第一级特征向量序列,其中,所述待识别语音包括语音帧序列,所述第一级特征向量序列与所述语音帧序列一一对应;
    对所述第一级特征向量序列进行门控循环操作,得到第二级特征向量序列;
    对所述第二级特征向量序列进行分类操作,得到所述唤醒词识别结果。
  3. 根据权利要求2所述的方法,其中,所述对所述第一级特征向量序列进行门控循环操作,得到第二级特征向量序列,包括重复执行以下操作:
    基于上一时刻输出向量和当前时刻输入向量,分别确定当前时刻更新门和当前时刻候选隐藏层信息,其中,所述当前时刻输入向量为所述第一级特征向量序列中的当前时刻的第一级特征向量;
    基于所述当前时刻候选隐藏层信息、上一时刻隐藏层信息和所述当前时刻更新门,确定当前时刻隐藏层信息;以及
    基于所述当前时刻隐藏层信息和预定参数,确定当前时刻输出向量,其中,所述当前时刻输出向量为所述第二级特征向量序列中的当前时刻的第二级特征向量。
  4. 根据权利要求1所述的方法,其中,所述在确定所述唤醒词识别结果用于表征所述待识别语音包括预定唤醒词的情况下,对所述待识别语音进行音节识别,得到唤醒音节识别结果,包括:
    对所述待识别语音进行音节特征提取,得到音节特征矩阵;以及
    对所述音节特征矩阵进行分类操作,得到所述唤醒音节识别结果。
  5. 根据权利要求4所述的方法,其中,所述对所述待识别语音进行音节特征提取,得到音节特征矩阵,包括:
    对所述待识别语音进行特征提取,得到特征矩阵;
    对所述特征矩阵进行降维,得到降维后的特征矩阵;以及
    对所述降维后的特征矩阵进行多级语音增强的编码处理,得到所述音节特征矩阵。
  6. 根据权利要求1至5中任一项所述的方法,还包括:
    在确定所述唤醒词识别结果用于表征所述待识别语音不包括预定唤醒词的情况下,确定所述待识别语音为错误唤醒语音。
  7. 根据权利要求3所述的方法,其中,所述预定参数是基于轻量化参数量阈值确定的。
  8. 根据权利要求1所述的方法,还包括:
    在确定所述待识别语音为正确唤醒语音的情况下,在显示界面展示目标对象;以及
    输出反馈语音。
  9. 一种语音唤醒装置,包括:
    词识别模块,用于对待识别语音进行词识别,得到唤醒词识别结果;
    音节识别模块,用于在确定所述唤醒词识别结果用于表征所述待识别语音包括预定唤醒词的情况下,对所述待识别语音进行音节识别,得到唤醒音节识别结果;以及
    第一确定模块,用于在确定所述唤醒音节识别结果用于表征所述待识别语音包括预定音节的情况下,确定所述待识别语音为正确唤醒语音。
  10. 根据权利要求9所述的装置,其中,所述词识别模块包括:
    卷积单元,用于对所述待识别语音进行卷积操作,得到第一级特征向量序列,其中,所述待识别语音包括语音帧序列,所述第一级特征向量序列与所述语音帧序列一一对应;
    门控单元,用于对所述第一级特征向量序列进行门控循环操作,得到第二级特征向量序列;
    词分类单元,用于对所述第二级特征向量序列进行分类操作,得到所述唤醒词识别结果。
  11. 根据权利要求10所述的装置,其中,所述门控单元,包括重复的以下子单元:
    第一确定子单元,用于基于上一时刻输出向量和当前时刻输入向量,分别确定当前时刻更新门和当前时刻候选隐藏层信息,其中,所述当前时刻输入向量为所述第一级特征向量序列中的当前时刻的第一级特征向量;
    第二确定子单元,用于基于所述当前时刻候选隐藏层信息、上一时刻隐藏层信息 和所述当前时刻更新门,确定当前时刻隐藏层信息;以及
    第三确定子单元,用于基于所述当前时刻隐藏层信息和预定参数,确定当前时刻输出向量,其中,所述当前时刻输出向量为所述第二级特征向量序列中的当前时刻的第二级特征向量。
  12. 根据权利要求9所述的装置,其中,所述音节识别模块包括:
    提取单元,用于对所述待识别语音进行音节特征提取,得到音节特征矩阵;以及
    音节分类单元,用于对所述音节特征矩阵进行分类操作,得到所述唤醒音节识别结果。
  13. 根据权利要求12所述的装置,其中,所述提取单元包括:
    提取子单元,用于对所述待识别语音进行特征提取,得到特征矩阵;
    降维子单元,用于对所述特征矩阵进行降维,得到降维后的特征矩阵;以及
    编码子单元,用于对所述降维后的特征矩阵进行多级语音增强的编码处理,得到所述音节特征矩阵。
  14. 根据权利要求9至13中任一项所述的装置,还包括:
    第二确定模块,用于在确定所述唤醒词识别结果用于表征所述待识别语音不包括预定唤醒词的情况下,确定所述待识别语音为错误唤醒语音。
  15. 根据权利要求11所述的装置,其中,所述预定参数是基于轻量化参数量阈值确定的。
  16. 根据权利要求9所述的装置,还包括:
    显示模块,用于在确定所述待识别语音为正确唤醒语音的情况下,在显示界面展示目标对象;以及
    反馈模块,用于输出反馈语音。
  17. 一种电子设备,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1至8中任一项所述的方法。
  18. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1至8中任一项所述的方法。
  19. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1至8中任一项所述的方法。
PCT/CN2023/072618 2022-07-15 2023-01-17 语音唤醒方法、装置、电子设备以及存储介质 WO2024011885A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210838284.6 2022-07-15
CN202210838284.6A CN115223573A (zh) 2022-07-15 2022-07-15 语音唤醒方法、装置、电子设备以及存储介质

Publications (1)

Publication Number Publication Date
WO2024011885A1 true WO2024011885A1 (zh) 2024-01-18

Family

ID=83612882

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/072618 WO2024011885A1 (zh) 2022-07-15 2023-01-17 语音唤醒方法、装置、电子设备以及存储介质

Country Status (2)

Country Link
CN (1) CN115223573A (zh)
WO (1) WO2024011885A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115223573A (zh) * 2022-07-15 2022-10-21 北京百度网讯科技有限公司 语音唤醒方法、装置、电子设备以及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999161A (zh) * 2012-11-13 2013-03-27 安徽科大讯飞信息科技股份有限公司 一种语音唤醒模块的实现方法及应用
CN111883117A (zh) * 2020-07-03 2020-11-03 北京声智科技有限公司 语音唤醒方法及装置
CN113299282A (zh) * 2021-07-23 2021-08-24 北京世纪好未来教育科技有限公司 一种语音识别方法、装置、设备及存储介质
CN115223573A (zh) * 2022-07-15 2022-10-21 北京百度网讯科技有限公司 语音唤醒方法、装置、电子设备以及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107767863B (zh) * 2016-08-22 2021-05-04 科大讯飞股份有限公司 语音唤醒方法、系统及智能终端
CN110473536B (zh) * 2019-08-20 2021-10-15 北京声智科技有限公司 一种唤醒方法、装置和智能设备
CN113889076B (zh) * 2021-09-13 2022-11-01 北京百度网讯科技有限公司 语音识别及编解码方法、装置、电子设备及存储介质
CN114242065A (zh) * 2021-12-31 2022-03-25 科大讯飞股份有限公司 语音唤醒方法及装置、语音唤醒模块的训练方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999161A (zh) * 2012-11-13 2013-03-27 安徽科大讯飞信息科技股份有限公司 一种语音唤醒模块的实现方法及应用
CN111883117A (zh) * 2020-07-03 2020-11-03 北京声智科技有限公司 语音唤醒方法及装置
CN113299282A (zh) * 2021-07-23 2021-08-24 北京世纪好未来教育科技有限公司 一种语音识别方法、装置、设备及存储介质
CN115223573A (zh) * 2022-07-15 2022-10-21 北京百度网讯科技有限公司 语音唤醒方法、装置、电子设备以及存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MIRCO RAVANELLI ET AL.: "Light Gated Recurrent Units for Speech Recognition", IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, vol. 2, no. 2, 23 March 2018 (2018-03-23) *
YUNDONG ZHANG; NAVEEN SUDA; LIANGZHEN LAI; VIKAS CHANDRA: "Hello Edge: Keyword Spotting on Microcontrollers", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 20 November 2017 (2017-11-20), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081319805 *

Also Published As

Publication number Publication date
CN115223573A (zh) 2022-10-21

Similar Documents

Publication Publication Date Title
WO2021093449A1 (zh) 基于人工智能的唤醒词检测方法、装置、设备及介质
US11450312B2 (en) Speech recognition method, apparatus, and device, and storage medium
TWI692751B (zh) 語音喚醒方法、裝置以及電子設備
WO2021050170A1 (en) Reduced training intent recognition techniques
US20210312137A1 (en) Encoding method, apparatus, and storage medium
US20230021555A1 (en) Model training based on parameterized quantum circuit
EP3251114B1 (en) Transcription correction using multi-token structures
CN113674746B (zh) 人机交互方法、装置、设备以及存储介质
JP7133002B2 (ja) 句読点予測方法および装置
US20220358955A1 (en) Method for detecting voice, method for training, and electronic devices
US20220068267A1 (en) Method and apparatus for recognizing speech, electronic device and storage medium
WO2024011885A1 (zh) 语音唤醒方法、装置、电子设备以及存储介质
WO2023193394A1 (zh) 语音唤醒模型的训练、唤醒方法、装置、设备及存储介质
US20210158823A1 (en) Method, apparatus, and medium for processing speech signal
WO2022228127A1 (zh) 要素文本处理方法、装置、电子设备和存储介质
CN112910761B (zh) 即时通讯方法、装置、设备、存储介质以及程序产品
WO2024040870A1 (zh) 文本图像生成、训练、文本图像处理方法以及电子设备
CN115687934A (zh) 意图识别方法、装置、计算机设备及存储介质
CN113468857B (zh) 风格转换模型的训练方法、装置、电子设备以及存储介质
CN113590782B (zh) 推理模型的训练方法、推理方法及装置
CN116306612A (zh) 一种词句生成方法及相关设备
US20230360638A1 (en) Method of processing speech information, method of training model, and wake-up method
KR20240067967A (ko) 음성 웨이크업 방법, 음성 웨이크업 장치, 전자장비, 저장 매체 및 컴퓨터 프로그램
CN113901841A (zh) 翻译方法、装置以及存储介质
US20230317060A1 (en) Method and apparatus for training voice wake-up model, method and apparatus for voice wake-up, device, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23838387

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20247014331

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2023838387

Country of ref document: EP

Effective date: 20240429