WO2022127620A1 - 语音唤醒方法、装置、电子设备及存储介质 - Google Patents

语音唤醒方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2022127620A1
WO2022127620A1 PCT/CN2021/135387 CN2021135387W WO2022127620A1 WO 2022127620 A1 WO2022127620 A1 WO 2022127620A1 CN 2021135387 W CN2021135387 W CN 2021135387W WO 2022127620 A1 WO2022127620 A1 WO 2022127620A1
Authority
WO
WIPO (PCT)
Prior art keywords
wake
model
sample data
data
adaptive
Prior art date
Application number
PCT/CN2021/135387
Other languages
English (en)
French (fr)
Inventor
田垚
姚海涛
蔡猛
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2022127620A1 publication Critical patent/WO2022127620A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Definitions

  • the present disclosure relates to the technical field of data processing, for example, to a voice wake-up method, an apparatus, an electronic device, and a storage medium.
  • Voice wake-up is one of the application fields of keyword detection. Users need to use wake-up words to start the entire voice interaction process.
  • the Recurrent Neural Network Transducer (RNN-T) model is used to achieve real-time detection, and RNN-T
  • the model consists of three parts: encoder network (Encoder), prediction network (prediction network) and joint network (joint network).
  • the model is usually trained by directly mixing wake-up word training data and voice recognition training data, and using the trained data for voice wake-up.
  • the role of the prediction network in RNN-T is similar to the role of the language model in speech recognition, the next word (word) is predicted by inputting the previous word (word). Since there are many wake-up words in the training data, these The text corresponding to the data is the same, which will lead to overfitting of the prediction model, so that there will be many false wake-up phenomena when using the model for voice wake-up, which affects the accuracy of voice wake-up.
  • the present disclosure provides a voice wake-up method, device, electronic device and storage medium, so as to realize voice wake-up through keyword detection.
  • a voice wake-up method including:
  • the seed wake-up model is obtained by training the initial wake-up model based on the first training sample data, wherein the first training sample data includes speech recognition data;
  • the trained adaptive wake-up model is used to perform keyword detection on the data to be tested to achieve voice wake-up.
  • a voice wake-up device comprising:
  • the seed wake-up model obtaining module is configured to train the initial wake-up model based on the first training sample data to obtain the seed wake-up model, wherein the first training sample data includes speech recognition data;
  • the adaptive wake-up model training module is set to initialize the seed wake-up model to obtain the adaptive wake-up model, and train the adaptive wake-up model based on the second training sample data, wherein the second training sample data includes speech recognition data and wake-up word data;
  • the voice wake-up module is set to use the trained adaptive wake-up model to perform keyword detection on the data to be tested, so as to realize voice wake-up.
  • An electronic device comprising:
  • processors one or more processors
  • one or more programs When executed by one or more processors, one or more programs cause the one or more processors to implement a method according to any embodiment of the present disclosure.
  • FIG. 1A is a flowchart of a voice wake-up method provided in Embodiment 1 of the present disclosure
  • FIG. 1B is a schematic structural diagram of an adaptive wake-up model provided by Embodiment 1 of the present disclosure
  • FIG. 2 is a flowchart of a voice wake-up method provided in Embodiment 2 of the present disclosure
  • FIG. 3 is a schematic structural diagram of a voice wake-up device according to Embodiment 3 of the present disclosure.
  • FIG. 4 is a schematic structural diagram of an electronic device according to Embodiment 4 of the present disclosure.
  • method embodiments of the present disclosure may be performed in different orders, and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • FIG. 1A is a flowchart of a voice wake-up method provided by an embodiment of the present disclosure. This embodiment is applicable to an end-to-end voice wake-up situation.
  • the method can be executed by the voice wake-up device provided by the embodiment of the present disclosure, and the device can use software and/or hardware, and can be integrated in computer equipment.
  • the method of the embodiment of the present disclosure includes the following steps.
  • the method in the embodiment of the present disclosure may include the following steps.
  • Step S101 training an initial wake-up model based on first training sample data to obtain a seed wake-up model, wherein the first training sample data includes speech recognition data.
  • the wake-up model includes a recurrent neural network converter RNN-T model
  • the RNN-T model includes an encoder network, a prediction network, and a joint network.
  • the encoder network can be represented by the symbol Encoder
  • the prediction network can be represented by the symbol Prediction Network
  • the joint network can be represented by the symbol Joint Network
  • the joint network is connected with the encoder network and the prediction network, respectively.
  • the input of the encoder network is the acoustic feature
  • the input of the prediction network is the last predicted symbol (text information)
  • the output of the entire RNN-T model is the probability distribution of the current symbol.
  • training the initial wake-up model based on the first training sample data to obtain the seed wake-up model may include: obtaining first initial sample data; expanding the first initial sample data to obtain first training sample data; The sample data is used to train the initial RNN-T model to obtain the seed RNN-T model.
  • the first initial sample data is obtained, wherein the first initial sample data includes a small amount of speech recognition data, and the speech recognition data refers to the speech of any content and the text corresponding to the speech.
  • the data can be expanded by means of indoor shock response, variable speed, and noise addition to expand the diversity of the data, so as to obtain the first training sample data with more abundant data.
  • the expanded first training sample data contains The data form is still speech recognition data, but the types of data are more diverse.
  • the initial RNN-T model is trained based on the first training sample data obtained after the expansion to obtain a seed RNN-T model, wherein the network parameters of the seed RNN-T model are more optimized than the initial RNN-T model.
  • Step S102 Initialize the seed wake-up model to obtain an adaptive wake-up model, and train the adaptive wake-up model based on second training sample data, wherein the second training sample data includes speech recognition data and wake-up word data.
  • initializing the seed wake-up model to obtain an adaptive wake-up model may include: adding a feed forward neural network (FFNN) on the basis of the seed RNN-T model, wherein the FFNN is connected to the encoder network ; Take the seed RNN-T model as the first branch, and take the FFNN and encoder network as the second branch; obtain the adaptive wake-up model according to the first branch and the second branch.
  • FFNN feed forward neural network
  • a schematic structural diagram of the adaptive wake-up model in this embodiment is to add FFNN on the basis of the seed RNN-T model, and the seed RNN-T model on the left in the figure is used as the first branch.
  • the FFNN and the encoder on the right in the middle are used as the second branch, so it can be seen that the first branch and the second branch have a common structural part, that is, the encoder network.
  • training the adaptive wake-up model based on the second training sample data includes: obtaining second initial sample data; expanding the second initial sample data to obtain second training sample data; Adapt the arousal model for training.
  • the second initial sample data is obtained, and the second initial sample data includes a small amount of speech recognition data and wake-up word data, although the first initial sample data also includes Speech recognition data, but only the data form is the same but the content contained in the speech recognition data is not the same; and the wake word data refers to the speech and text corresponding to the keywords.
  • the way of expanding the second initial sample data is roughly the same as the way of expanding the first initial sample data.
  • For the second initial sample data means such as indoor shock response, variable speed, and noise can be used to expand the diversity of the data.
  • the second training sample data with more abundant data is obtained.
  • the expanded second training sample data is processed, the data forms contained in the expanded second training sample data are still speech recognition data and wake-up word data, but the data types are more diverse.
  • training the adaptive wake-up model based on the second training sample data includes: training the first branch based on the speech recognition data in the second training sample data to obtain a first loss function result; based on the second training sample data The speech recognition data and wake-up word data in the second branch are trained to obtain the second loss function result; the weighted sum of the loss function of the first loss function result and the second loss function result is determined, and when the weighted sum of the loss function is less than the preset loss threshold , it is determined that the training of the adaptive wake-up model is completed.
  • the first branch and the second branch of the adaptive wake-up model in this embodiment correspond to different loss functions respectively, and when the adaptive wake-up model is trained based on the second training sample data, the first branch and the second branch use Different data, among which, the first branch uses speech recognition data, while the second branch uses speech recognition data and wake-up word data, because when training the adaptive wake-up model, the wake-up word data will not be predicted.
  • the network can solve the problem of overfitting of the prediction network because the texts corresponding to the wake words are all the same.
  • the following formula (1) can be used as the loss function of the entire model.
  • L MT represents the overall loss function of the adaptive wake-up model
  • L RNN-T represents the loss function corresponding to the first branch
  • L CTC represents the loss function corresponding to the second branch
  • represents the first branch.
  • represents the weight coefficient corresponding to the second branch.
  • the first branch can be trained based on the speech recognition data to obtain the first loss function result
  • the second branch can be trained based on the speech recognition data and wake-up word data to obtain the second loss function result.
  • the corresponding weight coefficient is ⁇
  • the weight coefficient corresponding to the second branch is ⁇ , so that the weighted sum of the loss function corresponding to the loss function of the entire model can be obtained, and the weighted sum of the loss function and the preset loss threshold set in advance can be obtained.
  • the comparison is made, and if it is less than the preset loss threshold, it is determined that the training of the adaptive wake-up model is completed.
  • Step S103 using the trained adaptive wake-up model to perform keyword detection on the data to be tested, so as to realize voice wake-up.
  • using the trained adaptive wake-up model to perform keyword detection on the data to be tested to implement voice wake-up may include: inputting the data to be tested into the trained adaptive wake-up model to obtain first predictions of the first branch respectively. probability value, and the second predicted probability value of the second branch; determine the probability weighted sum of the first predicted probability value and the second predicted probability value; when the probability weighted sum is greater than the preset probability threshold, the symbol corresponding to the probability weighted sum As a keyword for voice wake-up.
  • the adaptive wake-up model obtained after training should be more perfect.
  • the first branch will output the first prediction probability when making predictions value
  • the second branch will output the second predicted probability value
  • weight the two predicted probability values according to the weight coefficient corresponding to each branch and obtain the probability weighted sum.
  • the probability weighted sum is greater than the preset probability threshold, the probability weighted and the corresponding symbols are used as keywords for voice wake-up.
  • the keyword detection is performed using the probability weighted sum obtained by comprehensively calculating the predicted probability values obtained by the two branches of the adaptive wake-up model.
  • the detection result is more accurate, and the accuracy of wake-up is improved at the same time.
  • the adaptive wake-up model is trained by using training samples including speech recognition data and wake-up word data, thereby avoiding the need for model training.
  • the trained adaptive wake-up model for voice wake-up can improve the wake-up accuracy.
  • FIG. 2 is a flowchart of the voice wake-up method provided by the second embodiment of the present disclosure.
  • the embodiment of the present disclosure may be combined with multiple optional solutions in the above-mentioned embodiments.
  • a trained adaptive wake-up model is used to treat After the keyword detection is performed on the measured data to realize the voice wake-up, the method further includes: detecting the voice wake-up result.
  • the method of the embodiment of the present disclosure includes the following steps.
  • Step S201 training an initial wake-up model based on first training sample data to obtain a seed wake-up model, wherein the first training sample data includes speech recognition data.
  • the wake-up model includes a recurrent neural network converter RNN-T model
  • the RNN-T model includes an encoder network, a prediction network, and a joint network.
  • the encoder network can be represented by the symbol Encoder
  • the prediction network can be represented by the symbol Prediction Network
  • the joint network can be represented by the symbol Joint Network
  • the joint network is connected with the encoder network and the prediction network, respectively.
  • the input of the encoder network is the acoustic feature
  • the input of the prediction network is the last predicted symbol (text information)
  • the output of the entire RNN-T model is the probability distribution of the current symbol.
  • training the initial wake-up model based on the first training sample data to obtain the seed wake-up model may include: obtaining first initial sample data; expanding the first initial sample data to obtain first training sample data; The sample data is used to train the initial RNN-T model to obtain the seed RNN-T model.
  • Step S202 Initialize the seed wake-up model to obtain an adaptive wake-up model, and train the adaptive wake-up model based on second training sample data, wherein the second training sample data includes speech recognition data and wake-up word data.
  • initializing the seed wake-up model to obtain an adaptive wake-up model may include: adding FFNN on the basis of the seed RNN-T model, wherein the FFNN is connected to the encoder network; using the seed RNN-T model as the first branch , taking the FFNN and the encoder network as the second branch; the adaptive wake-up model is obtained according to the first branch and the second branch.
  • training the adaptive wake-up model based on the second training sample data includes: obtaining second initial sample data; expanding the second initial sample data to obtain second training sample data; Adapt the arousal model for training.
  • training the adaptive wake-up model based on the second training sample data includes: training the first branch based on the speech recognition data in the second training sample data to obtain a first loss function result; based on the second training sample data The speech recognition data and wake-up word data in the second branch are trained to obtain the second loss function result; the weighted sum of the loss function of the first loss function result and the second loss function result is determined, and when the weighted sum of the loss function is less than the preset loss threshold , it is determined that the training of the adaptive wake-up model is completed.
  • Step S203 using the trained adaptive wake-up model to perform keyword detection on the data to be tested, so as to realize voice wake-up.
  • using the trained adaptive wake-up model to perform keyword detection on the data to be tested to implement voice wake-up may include: inputting the data to be tested into the trained adaptive wake-up model to obtain first predictions of the first branch respectively. probability value, and the second predicted probability value of the second branch; determine the probability weighted sum of the first predicted probability value and the second predicted probability value; when the probability weighted sum is greater than the preset probability threshold, the symbol corresponding to the probability weighted sum As a keyword for voice wake-up.
  • Step S204 detecting the voice wake-up result.
  • the voice wake-up result that is, to detect whether the device can start the process of voice interaction according to the keyword , for example, determine that the keyword is "ABAB", when determining the voice information corresponding to the keyword issued by the user, determine whether the device can interact according to the keywords contained in the voice information, for example, determine whether the device can give a voice response "Do you have any instructions?" If the device can start the voice interaction process, it is determined that the voice wake-up result is accurate; otherwise, it is determined that the voice wake-up result fails.
  • the reason for the failure may be caused by the hardware failure of the device itself, or it may be caused by the inaccurate sample data of the adaptive wake-up model during the training process.
  • This embodiment does not limit the voice wake-up. The reason the result failed.
  • an alarm prompt will be given.
  • the alarm prompt can be in the form of voice or text. This embodiment does not limit the form of the alarm prompt.
  • the alarm prompt can prompt the user to repair or repair the equipment as soon as possible. Adjust the voice wake-up process to ensure the accuracy of voice wake-up.
  • the adaptive wake-up model is trained by using training samples including speech recognition data and wake-up word data, thereby avoiding the need for model training.
  • the trained adaptive wake-up model for voice wake-up can improve the wake-up accuracy.
  • FIG. 3 is a schematic structural diagram of a voice wake-up device provided in Embodiment 3 of the present disclosure.
  • the apparatus can be implemented in software and/or hardware, and can be integrated in an electronic device that executes the voice wake-up method. As shown in Figure 3, the apparatus may include the following modules.
  • the seed wake-up model obtaining module 310 is configured to perform training on the initial wake-up model based on the first training sample data to obtain the seed wake-up model, wherein the first training sample data includes speech recognition data;
  • the adaptive wake-up model training module 320 is configured to initialize the seed wake-up model to obtain the adaptive wake-up model, and train the adaptive wake-up model based on the second training sample data, wherein the second training sample data includes speech recognition data and Wake word data.
  • the voice wake-up module 330 is configured to use the trained adaptive wake-up model to perform keyword detection on the data to be tested, so as to realize voice wake-up.
  • the adaptive wake-up model is trained by using training samples including speech recognition data and wake-up word data, thereby avoiding the need for model training.
  • the trained adaptive wake-up model for voice wake-up can improve the wake-up accuracy.
  • the wake-up model includes a cyclic neural network converter RNN-T model
  • the RNN-T model includes an encoder network, a prediction network and a joint network
  • the joint network is respectively associated with the encoder network and the prediction network. Internet connection.
  • the seed wake-up model obtaining module 310 is configured to obtain the first initial sample data
  • the initial RNN-T model is trained based on the first training sample data to obtain a seed RNN-T model.
  • the adaptive wake-up model training module 320 includes an adaptive wake-up acquisition sub-module, and is set to add FFNN on the basis of the seed RNN-T model, wherein the FFNN is connected to the encoder network;
  • the seed RNN-T model is used as the first branch, and the FFNN and encoder network are used as the second branch;
  • An adaptive wake-up model is obtained according to the first branch and the second branch.
  • the adaptive wake-up model training module 320 includes an adaptive wake-up model training sub-module, which is configured to obtain second initial sample data;
  • the adaptive wake-up model is trained based on the second training sample data.
  • the adaptive wake-up model training sub-module is further configured to: perform training on the first branch based on the speech recognition data in the second training sample data to obtain a first loss function result;
  • the second branch is trained based on the speech recognition data and the wake-up word data in the second training sample data to obtain a second loss function result
  • the weighted sum of the loss functions of the first loss function result and the second loss function result is determined, and when the weighted sum of the loss functions is less than the preset loss threshold, it is determined that the training of the adaptive wake-up model is completed.
  • the voice wake-up module 330 is set to: input the data to be tested into the trained adaptive wake-up model, and obtain the first predicted probability value of the first branch and the first predicted probability value of the second branch respectively. 2. The predicted probability value;
  • the symbol corresponding to the probability weighted sum is used as a keyword for voice wake-up.
  • the voice wake-up device provided by the embodiments of the present disclosure belongs to the same concept as the voice wake-up methods provided by the above embodiments.
  • For technical details not described in detail in the embodiments of the present disclosure reference may be made to the above embodiments, and the embodiments of the present disclosure It has the same effect as the above-described various embodiments.
  • the electronic device in the embodiment of the present disclosure may be a device corresponding to the back-end service platform of the application, or may be a mobile terminal device on which an application client is installed.
  • the electronic device may include, for example, a mobile phone, a notebook computer, a digital broadcast receiver, a Personal Digital Assistant (PDA), a Tablet Computer (Portable Android Device, PAD), a Portable Multimedia Player (PMP), Mobile terminals such as in-vehicle terminals (eg, in-vehicle navigation terminals), and stationary terminals such as digital televisions (TVs), desktop computers, and the like.
  • PDA Personal Digital Assistant
  • PAD Tablet Computer
  • PMP Portable Multimedia Player
  • Mobile terminals such as in-vehicle terminals (eg, in-vehicle navigation terminals)
  • stationary terminals such as digital televisions (TVs), desktop computers, and the like.
  • TVs digital televisions
  • the electronic device 400 may include a processing device (such as a central processing unit, a graphics processor, etc.) 401, and the processing device 401 may be based on a program stored in a read-only memory (Read-only Memory, ROM) 402 or from a
  • the storage device 408 loads a program into a random access memory (RAM) 403 to perform various appropriate actions and processes.
  • RAM random access memory
  • various programs and data required for the operation of the electronic device 400 are also stored.
  • the processing device 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404.
  • An Input/Output (I/O) interface 405 is also connected to the bus 404 .
  • the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD), speakers output device 407 , vibrator, etc.; storage device 408 including, for example, magnetic tape, hard disk, etc.; and communication device 409 .
  • Communication means 409 may allow electronic device 400 to communicate wirelessly or by wire with other devices to exchange data.
  • FIG. 4 shows electronic device 400 having various means, it is not required to implement or have all of the illustrated means. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 409, or from the storage device 408, or from the ROM 402.
  • the processing apparatus 401 When the computer program is executed by the processing apparatus 401, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium described above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium can be, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • Computer readable storage media can include: electrical connections with one or more wires, portable computer disks, hard disks, RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), flash memory, optical fibers , portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the storage medium may be a non-transitory storage medium.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • the program code embodied on the computer-readable medium may be transmitted by any suitable medium, including: electric wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above.
  • clients and servers can communicate using any currently known or future developed network protocols, such as HyperText Transfer Protocol (HTTP), and can communicate with digital data in any form or medium.
  • Communication eg, a communication network
  • Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently Known or future developed networks.
  • LANs Local Area Networks
  • WANs Wide Area Networks
  • the Internet eg, the Internet
  • peer-to-peer networks eg, ad hoc peer-to-peer networks
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the internal process of the electronic device is executed: the initial wake-up model is trained based on the first training sample data to obtain seed wake-up model, wherein the first training sample data includes speech recognition data; the seed wake-up model is initialized to obtain an adaptive wake-up model, and the adaptive wake-up model is trained based on the second training sample data, wherein the second training sample data in Contains speech recognition data and wake-up word data; uses the trained adaptive wake-up model to perform keyword detection on the data to be tested to achieve voice wake-up.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and also conventional procedures, or a combination thereof programming languages such as "C" or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a LAN or WAN, or may be connected to an external computer (eg, through the Internet using an Internet service provider).
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • Each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or special purpose hardware implemented in combination with computer instructions.
  • the units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner. Among them, the name of the unit does not constitute a limitation of the unit itself in one case.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Parts
  • SOC System on Chip
  • CPLD Complex Programmable Logic Device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • Machine-readable storage media include one or more wire-based electrical connections, portable computer disks, hard disks, RAM, ROM, EPROM, flash memory, optical fibers, portable CD-ROMs, optical storage devices, magnetic storage devices, or the above any suitable combination of content.
  • Example 1 provides a voice wake-up method, including:
  • the seed wake-up model is obtained by training the initial wake-up model based on the first training sample data, wherein the first training sample data includes speech recognition data;
  • the seed wake-up model is initialized to obtain an adaptive wake-up model, and the adaptive wake-up model is trained based on second training sample data, wherein the second training sample data includes speech recognition data and wake-up word data;
  • the trained adaptive wake-up model is used to perform keyword detection on the data to be tested to achieve voice wake-up.
  • Example 2 provides the method of Example 1, the wake-up model includes a recurrent neural network converter RNN-T model, the RNN-T model includes an encoder network, a prediction network and a joint network, and the joint network is connected with the encoder network and the prediction network, respectively.
  • the wake-up model includes a recurrent neural network converter RNN-T model
  • the RNN-T model includes an encoder network, a prediction network and a joint network
  • the joint network is connected with the encoder network and the prediction network, respectively.
  • Example 3 provides the method of Example 2, wherein the seed arousal model is obtained by training an initial arousal model based on the first training sample data, including:
  • the initial RNN-T model is trained based on the first training sample data to obtain a seed RNN-T model.
  • Example 4 provides the method of Example 3, wherein the initializing the seed wake-up model to obtain an adaptive wake-up model includes:
  • the seed RNN-T model is used as the first branch, and the FFNN and the encoder network are used as the second branch;
  • the adaptive wake-up model is obtained from the first branch and the second branch.
  • Example 5 provides the method of Example 4, wherein the training of the adaptive wake-up model based on the second training sample data includes:
  • the adaptive wake-up model is trained based on the second training sample data.
  • Example 6 provides the method of Example 5, wherein the training of the adaptive wake-up model based on the second training sample data includes:
  • the first branch is trained based on the speech recognition data in the second training sample data to obtain a first loss function result
  • the second branch is trained to obtain a second loss function result
  • a weighted sum of the loss functions of the first loss function result and the second loss function result is determined, and when the weighted sum of the loss functions is less than a preset loss threshold, it is determined that the training of the adaptive wake-up model is completed.
  • Example 7 provides the method of Example 4, wherein using the trained adaptive wake-up model to perform keyword detection on the data to be tested to implement voice wake-up, including:
  • the symbol corresponding to the probability weighted sum is used as a keyword for voice wake-up.
  • Example 8 provides a voice wake-up device, including:
  • the seed wake-up model acquisition module is configured to train the initial wake-up model based on the first training sample data to obtain the seed wake-up model, wherein the first training sample data includes speech recognition data;
  • the adaptive wake-up model training module is set to The seed wake-up model is initialized to obtain an adaptive wake-up model, and the adaptive wake-up model is trained based on second training sample data, wherein the second training sample data includes speech recognition data and wake-up word data;
  • the voice wake-up module is set to use the trained adaptive wake-up model to perform keyword detection on the data to be tested, so as to realize voice wake-up.
  • Example 9 provides the apparatus of Example 8, the wake-up model includes a recurrent neural network converter RNN-T model, the RNN-T model includes an encoder network, a prediction network and a joint network, and the joint network is connected with the encoder network and the prediction network, respectively.
  • Example 10 provides the apparatus of Example 9, the seed wake model obtaining module is configured to obtain the first initial sample data;
  • the initial RNN-T model is trained based on the first training sample data to obtain a seed RNN-T model.
  • Example 11 provides the apparatus of Example 10, the adaptive wake-up model training module includes an adaptive wake-up acquisition sub-module, and is set to add on the basis of the seed RNN-T model FFNN, wherein the FFNN is connected to the encoder network;
  • the seed RNN-T model is used as the first branch, and the FFNN and the encoder network are used as the second branch;
  • the adaptive wake-up model is obtained from the first branch and the second branch.
  • Example 12 provides the apparatus of Example 11, the adaptive wake-up model training module includes an adaptive wake-up model training sub-module configured to obtain second initial sample data;
  • the adaptive wake-up model is trained based on the second training sample data.
  • Example 13 provides the apparatus of Example 12, and the adaptive wake-up model training sub-module is further configured to: based on the speech recognition data in the second training sample data, the first branch Perform training to obtain the first loss function result;
  • the second branch is trained based on the speech recognition data and wake-up word data in the second training sample data to obtain a second loss function result
  • a weighted sum of the loss functions of the first loss function result and the second loss function result is determined, and when the weighted sum of the loss functions is less than a preset loss threshold, it is determined that the training of the adaptive wake-up model is completed.
  • Example 14 provides the apparatus of Example 11, the voice wake-up module is set to: input the data to be tested into the trained adaptive wake-up model, and obtain the first branch respectively The first predicted probability value of , and the second predicted probability value of the second branch;
  • the symbol corresponding to the probability weighted sum is used as a keyword for voice wake-up.
  • Example 15 provides an electronic device, including:
  • processors one or more processors
  • the one or more programs when executed by the one or more processors, cause the one or more processors to implement the method as described in any one of Examples 1-7.
  • Example 16 provides a storage medium containing computer-executable instructions, storing a computer program, wherein, when the computer program is executed by a processor, implementations such as Example 1-Example 7 are provided any of the methods described above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

一种语音唤醒方法、装置、电子设备及存储介质。该语音唤醒方法包括:基于第一训练样本数据对初始唤醒模型进行训练获得种子唤醒模型(S101);根据种子唤醒模型进行初始化获得自适应唤醒模型,并基于第二训练样本数据对自适应唤醒模型进行训练(S102),其中,第二训练样本数据中包含语音识别数据和唤醒词数据;采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒(S103)。

Description

语音唤醒方法、装置、电子设备及存储介质
本申请要求在2020年12月14日提交中国专利局、申请号为202011474857.9的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开涉及数据处理技术领域,例如涉及一种语音唤醒方法、装置、电子设备及存储介质。
背景技术
语音唤醒是关键词检测的应用领域之一,用户需要通过唤醒词来开启整个语音交互的流程,采用循环神经网络转换机(Recurrent Neural Network Transducer,RNN-T)模型实现实时检测,并且RNN-T模型由编码器网络(Encoder)、预测网络(prediction network)和联合网络(joint network)三个部分组成。
对于应用于语音唤醒的RNN-T模型,通常是直接混合唤醒词训练数据和语音识别训练数据来训练该模型,并采用训练后的数据进行语音唤醒。但是由于RNN-T中预测网络的角色类似于语音识别中语言模型的角色,通过输入上一个字(词)来预测下一个字(词),由于训练数据中存在非常多唤醒词的数据,这些数据对应的文本都是一样的,会导致预测模型出现过拟合,从而利用模型进行语音唤醒时会出现非常多的误唤醒现象,影响语音唤醒的准确性。
发明内容
本公开提供了一种语音唤醒方法、装置、电子设备及存储介质,以实现通过关键词的检测实现语音唤醒。
提供了一种语音唤醒方法,包括:
基于第一训练样本数据对初始唤醒模型进行训练获得种子唤醒模型,其中,第一训练样本数据中包含语音识别数据;
对种子唤醒模型进行初始化获得自适应唤醒模型,并基于第二训练样本数据对自适应唤醒模型进行训练,其中,第二训练样本数据中包含语音识别数据和唤醒词数据;
采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒。
还提供了一种语音唤醒装置,该装置包括:
种子唤醒模型获取模块,设置为基于第一训练样本数据对初始唤醒模型进行训练获得种子唤醒模型,其中,第一训练样本数据中包含语音识别数据;
自适应唤醒模型训练模块,设置为对种子唤醒模型进行初始化获得自适应唤醒模型,并基于第二训练样本数据对自适应唤醒模型进行训练,其中,第二训练样本数据中包含语音识别数据和唤醒词数据;
语音唤醒模块,设置为采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒。
还提供了一种电子设备,该电子设备包括:
一个或多个处理器;
存储装置,设置为存储一个或多个程序,
当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如本公开任意实施例的方法。
还提供了一种计算机可读存储介质,存储有计算机程序,该计算机程序被处理器执行时实现如本公开任意实施例的方法。
附图说明
图1A是本公开实施例一提供的一种语音唤醒方法的流程图;
图1B是本公开实施例一提供的自适应唤醒模型的结构示意图;
图2是本公开实施例二提供的一种语音唤醒方法的流程图;
图3是本公开实施例三提供的一种语音唤醒装置的结构示意图;
图4是本公开实施例四提供的一种电子设备的结构示意图。
具体实施方式
下面将参照附图描述本公开的实施例。虽然附图中显示了本公开的一些实施例,然而本公开可以通过多种形式来实现,而且不应该被解释为限于这里阐述的实施例。本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
本公开的方法实施方式中记载的多个步骤可以按照不同的顺序执行,和/或 并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,除非在上下文另有指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
实施例一
图1A是本公开实施例提供的语音唤醒方法的流程图,本实施例可适用于端到端的语音唤醒的情况,该方法可以由本公开实施例提供的语音唤醒装置来执行,该装置可采用软件和/或硬件的方式实现,并可集成在计算机设备中。本公开实施例的方法包括如下步骤。
如图1A所示,本公开实施例中的方法可以包括如下步骤。
步骤S101,基于第一训练样本数据对初始唤醒模型进行训练获得种子唤醒模型,其中,第一训练样本数据中包含语音识别数据。
可选的,唤醒模型包括循环神经网络转换机RNN-T模型,RNN-T模型包括编码器网络、预测网络和联合网络,编码器网络可以用符号Encoder表示,预测网络可以用符号Prediction Network表示,联合网络可以用符号Joint Network表示,并且联合网络分别与编码器网络和预测网络连接。其中,编码器网络的输入是声学特征,预测网络的输入是上一个预测出的符号(文本信息),而整个RNN-T模型的输出是当前符号的概率分布。
可选的,基于第一训练样本数据对初始唤醒模型进行训练获得种子唤醒模型,可以包括:获取第一初始样本数据;对第一初始样本数据进行扩充获得第一训练样本数据;基于第一训练样本数据对初始RNN-T模型进行训练获得种子RNN-T模型。
本实施方式中获取第一初始样本数据,其中,第一初始样本数据中包括的是少量的语音识别数据,语音识别数据指的是任意内容的语音和语音所对应的文本,针对第一初始样本数据可以采用室内冲击响应、变速和加噪声等手段,来扩充数据的多样性,从而获得数据更加丰富的第一训练样本数据,虽然经过扩充处理,但扩充后的第一训练样本数据中包含的数据形式依然是语音识别数据,只是数据的类型更加多样化。
基于扩充后所获得的第一训练样本数据对初始RNN-T模型进行训练获得种子RNN-T模型,其中,种子RNN-T模型相对于初始RNN-T模型的网络参数要更加优化。
步骤S102,对种子唤醒模型进行初始化获得自适应唤醒模型,并基于第二训练样本数据对自适应唤醒模型进行训练,其中,第二训练样本数据中包含语音识别数据和唤醒词数据。
可选的,对种子唤醒模型进行初始化获得自适应唤醒模型,可以包括:在种子RNN-T模型的基础上添加前向神经网络(Feed Forward Neural Networks,FFNN),其中,FFNN与编码器网络连接;将种子RNN-T模型作为第一分支,将FFNN和编码器网络作为第二分支;根据第一分支和第二分支获得自适应唤醒模型。
如图1B所示为本实施方式中的自适应唤醒模型的结构示意图,是在种子RNN-T模型的基础上添加FFNN,并且将图中左边的种子RNN-T模型作为第一分支,将图中右边的FFNN和编码器作为第二分支,从而可以看出第一分支和第二分支具有共同的结构部分,即编码器网络。
可选的,基于第二训练样本数据对自适应唤醒模型进行训练,包括:获取第二初始样本数据;对第二初始样本数据进行扩充获得第二训练样本数据;基于第二训练样本数据对自适应唤醒模型进行训练。
本实施方式在对自适应唤醒模型进行训练时,会获取第二初始样本数据,在第二初始样本数据中包括的是少量的语音识别数据和唤醒词数据,虽然第一初始样本数据中也包括语音识别数据,但仅是数据形式相同而语音识别数据中所包含的内容并不相同;而唤醒词数据指的是关键词所对应的语音和文本。对第二初始样本数据进行扩充的方式与第一初始样本数据进行扩充的方式大致相同,是针对第二初始样本数据可以采用室内冲击响应、变速和加噪声等手段,来扩充数据的多样性,从而获得数据更加丰富的第二训练样本数据,虽然经过扩充处理,但扩充后的第二训练样本数据中包含的数据形式依然是语音识别数据和唤醒词数据,只是数据的类型更加多样化。
可选的,基于第二训练样本数据对自适应唤醒模型进行训练,包括:基于第二训练样本数据中的语音识别数据对第一分支进行训练获得第一损失函数结果;基于第二训练样本数据中的语音识别数据和唤醒词数据对第二分支进行训练获得第二损失函数结果;确定第一损失函数结果和第二损失函数结果的损失函数加权和,当损失函数加权和小于预设损失阈值时,则确定自适应唤醒模型训练完成。
其中,本实施方式中自适应唤醒模型的第一分支和第二分支分别对应不同的损失函数,并且在基于第二训练样本数据对自适应唤醒模型进行训练时,第一分支和第二分支采用不同的数据,其中,第一分支采用的是语音识别数据,而第二分支则采用的是语音识别数据和唤醒词数据,由于在对自适应唤醒模型进行训练时,唤醒词数据不会经过预测网络,从而可以解决由于唤醒词所对应的文本都是一样的,而导致预测网络出现过拟合的情况。在进行自适应唤醒模型训练的过程中,可以采用如下公式(1)作为整个模型的损失函数。
L MT=αL RNN-T+βL CTC        (1)
其中,L MT表示自适应唤醒模型的整体的损失函数,L RNN-T表示第一个分支所对应的损失函数,L CTC表示第二个分支所对应的损失函数,α表示第一个分支所对应的权重系数,β表示第二个分支所对应的权重系数。
在进行训练的时候可以基于语音识别数据对第一分支进行训练获得第一损失函数结果,基于语音识别数据和唤醒词数据对第二分支进行训练获得第二损失函数结果,由于第一个分支所对应的权重系数为α,第二个分支所对应的权重系数为β,从而可以得出整个模型的损失函数所对应的损失函数加权和,将损失函数加权和与提前所设置的预设损失阈值进行比较,如果小于预设损失阈值,则确定自适应唤醒模型训练完成。
步骤S103,采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒。
可选的,采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒,可以包括:将待测数据输入训练后的自适应唤醒模型,分别获得第一分支的第一预测概率值,以及第二分支的第二预测概率值;确定第一预测概率值和第二预测概率值的概率加权和;当概率加权和大于预设概率阈值时,将概率加权和所对应的符号作为进行语音唤醒的关键词。
由于在训练过程中,唤醒词数据不会经过预测网络,从而可以解决由于唤 醒词所对应的文本都是一样的,而导致预测网络出现过拟合的情况。因此经过训练所获得的自适应唤醒模型要更加完善。在采用训练后的自适应唤醒模型对待测数据进行关键词检测,实现语音唤醒的过程中,由于自适应唤醒模型包含两个分支,因此在进行预测的时候,第一分支会输出第一预测概率值,第二分支会输出第二预测概率值,将两个预测概率值按照每个分支所对应的权重系数进行加权,获得概率加权和,当概率加权和大于预设概率阈值时,将概率加权和所对应的符号作为进行语音唤醒的关键词。
本实施方式中在采用训练完成的自适应唤醒模型进行关键词检测时,是通过自适应唤醒模型的两个分支所得出的预测概率值进行综合计算所获得的概率加权和进行关键词的检测,从而使得检测结果更加准确,同时提高了唤醒的准确性。
本公开实施例中,通过获得种子唤醒模型,并以种子唤醒模型为基础获得自适应唤醒模型,采用包含语音识别数据和唤醒词数据的训练样本对自适应唤醒模型进行训练,从而避免在模型训练的过程中出现过拟合的情况,并且采用训练后的自适应唤醒模型进行语音唤醒能够提高唤醒的准确性。
实施例二
图2是本公开实施例二提供的语音唤醒方法的流程图,本公开实施例可以与上述实施例中多个可选方案结合,本公开实施例中,在采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒之后,还包括:对语音唤醒结果进行检测。
如图2所示,本公开实施例的方法包括以下步骤。
步骤S201,基于第一训练样本数据对初始唤醒模型进行训练获得种子唤醒模型,其中,第一训练样本数据中包含语音识别数据。
可选的,唤醒模型包括循环神经网络转换机RNN-T模型,RNN-T模型包括编码器网络、预测网络和联合网络,编码器网络可以用符号Encoder表示,预测网络可以用符号Prediction Network表示,联合网络可以用符号Joint Network表示,并且联合网络分别与编码器网络和预测网络连接。其中,编码器网络的输入是声学特征,预测网络的输入是上一个预测出的符号(文本信息),而整个RNN-T模型的输出是当前符号的概率分布。
可选的,基于第一训练样本数据对初始唤醒模型进行训练获得种子唤醒模型,可以包括:获取第一初始样本数据;对第一初始样本数据进行扩充获得第一训练样本数据;基于第一训练样本数据对初始RNN-T模型进行训练获得种子 RNN-T模型。
步骤S202,对种子唤醒模型进行初始化获得自适应唤醒模型,并基于第二训练样本数据对自适应唤醒模型进行训练,其中,第二训练样本数据中包含语音识别数据和唤醒词数据。
可选的,对种子唤醒模型进行初始化获得自适应唤醒模型,可以包括:在种子RNN-T模型的基础上添加FFNN,其中,FFNN与编码器网络连接;将种子RNN-T模型作为第一分支,将FFNN和编码器网络作为第二分支;根据第一分支和第二分支获得自适应唤醒模型。
可选的,基于第二训练样本数据对自适应唤醒模型进行训练,包括:获取第二初始样本数据;对第二初始样本数据进行扩充获得第二训练样本数据;基于第二训练样本数据对自适应唤醒模型进行训练。
可选的,基于第二训练样本数据对自适应唤醒模型进行训练,包括:基于第二训练样本数据中的语音识别数据对第一分支进行训练获得第一损失函数结果;基于第二训练样本数据中的语音识别数据和唤醒词数据对第二分支进行训练获得第二损失函数结果;确定第一损失函数结果和第二损失函数结果的损失函数加权和,当损失函数加权和小于预设损失阈值时,则确定自适应唤醒模型训练完成。
步骤S203,采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒。
可选的,采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒,可以包括:将待测数据输入训练后的自适应唤醒模型,分别获得第一分支的第一预测概率值,以及第二分支的第二预测概率值;确定第一预测概率值和第二预测概率值的概率加权和;当概率加权和大于预设概率阈值时,将概率加权和所对应的符号作为进行语音唤醒的关键词。
步骤S204,对语音唤醒结果进行检测。
本实施方式中在采用已经训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒之后,还需要对语音唤醒结果进行检测,即检测设备是否能够根据关键词开启语音交互的流程,例如,确定关键词为“ABAB”,在确定用户发出关键词所对应的语音信息时,判断设备是否能够根据语音信息中所包含的关键词进行交互,例如,判断设备是否能给出语音回应“请问有什么指示”,如果设备能够开启语音交互的流程,则确定语音唤醒结果准确,否则确定语音唤醒结果失败。
在确定语音唤醒结果失败时,造成失败的原因可能是设备本身硬件故障造 成的,也可能是自适应唤醒模型在训练过程中由于样本数据不准确所造成的,本实施方式中并不限定语音唤醒结果失败的原因。并且在语音唤醒结果失败的情况下会进行报警提示,报警提示可以采用语音的形式或文字的形式,本实施方式中并不限定报警提示的形式,通过报警提示可以提示用户尽快对设备进行维修或对语音唤醒过程进行调整,从而保证语音唤醒的准确性。
本公开实施例中,通过获得种子唤醒模型,并以种子唤醒模型为基础获得自适应唤醒模型,采用包含语音识别数据和唤醒词数据的训练样本对自适应唤醒模型进行训练,从而避免在模型训练的过程中出现过拟合的情况,并且采用训练后的自适应唤醒模型进行语音唤醒能够提高唤醒的准确性。通过对语音唤醒结果进行检测,并在确定语音唤醒失败的情况下进行报警提示,可以提示用户及时对设备进行维修或对语音唤醒过程进行调整,从而保证语音唤醒的准确性。
实施例三
图3是本公开实施例三提供的语音唤醒装置的结构示意图。该装置可采用软件和/或硬件的方式实现,并可集成在执行语音唤醒方法的电子设备中。如图3所示,该装置可以包括以下模块。
种子唤醒模型获取模块310,设置为基于第一训练样本数据对初始唤醒模型进行训练获得种子唤醒模型,其中,第一训练样本数据中包含语音识别数据;
自适应唤醒模型训练模块320,设置为对种子唤醒模型进行初始化获得自适应唤醒模型,并基于第二训练样本数据对自适应唤醒模型进行训练,其中,第二训练样本数据中包含语音识别数据和唤醒词数据。
语音唤醒模块330,设置为采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒。
本公开实施例中,通过获得种子唤醒模型,并以种子唤醒模型为基础获得自适应唤醒模型,采用包含语音识别数据和唤醒词数据的训练样本对自适应唤醒模型进行训练,从而避免在模型训练的过程中出现过拟合的情况,并且采用训练后的自适应唤醒模型进行语音唤醒能够提高唤醒的准确性。
可选的,在上述技术方案的基础上,唤醒模型包括循环神经网络转换机RNN-T模型,RNN-T模型包括编码器网络、预测网络和联合网络,并且联合网络分别与编码器网络和预测网络连接。
可选的,在上述技术方案的基础上,种子唤醒模型获取模块310,设置为获取第一初始样本数据;
对第一初始样本数据进行扩充获得第一训练样本数据;
基于第一训练样本数据对初始RNN-T模型进行训练获得种子RNN-T模型。
可选的,在上述技术方案的基础上,自适应唤醒模型训练模块320包括自适应唤醒获取子模块,设置为在种子RNN-T模型的基础上添加FFNN,其中,FFNN与编码器网络连接;
将种子RNN-T模型作为第一分支,将FFNN和编码器网络作为第二分支;
根据第一分支和第二分支获得自适应唤醒模型。
可选的,在上述技术方案的基础上,自适应唤醒模型训练模块320包括自适应唤醒模型训练子模块,设置为获取第二初始样本数据;
对第二初始样本数据进行扩充获得第二训练样本数据;
基于第二训练样本数据对自适应唤醒模型进行训练。
可选的,在上述技术方案的基础上,自适应唤醒模型训练子模块还设置为:基于第二训练样本数据中的语音识别数据对第一分支进行训练获得第一损失函数结果;
基于第二训练样本数据中的语音识别数据和唤醒词数据对第二分支进行训练获得第二损失函数结果;
确定第一损失函数结果和第二损失函数结果的损失函数加权和,当损失函数加权和小于预设损失阈值时,则确定自适应唤醒模型训练完成。
可选的,在上述技术方案的基础上,语音唤醒模块330设置为:将待测数据输入训练后的自适应唤醒模型,分别获得第一分支的第一预测概率值,以及第二分支的第二预测概率值;
确定第一预测概率值和第二预测概率值的概率加权和;
当概率加权和大于预设概率阈值时,将概率加权和所对应的符号作为进行语音唤醒的关键词。
本公开实施例提供的语音唤醒装置,与上述多个实施例提供的语音唤醒方法属于同一构思,未在本公开实施例中详尽描述的技术细节可参见上述多个实施例,并且本公开实施例与上述多个实施例具有相同的效果。
实施例四
下面参考图4,示出了适于用来实现本公开实施例的电子设备400的结构示意图。本公开实施例中的电子设备可以是应用程序的后端服务平台对应的设备, 还可以是安装有应用程序客户端的移动终端设备。该电子设备可以包括诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,PDA)、平板电脑(Portable Android Device,PAD)、便携式多媒体播放器(Portable Multimedia Player,PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字电视(television,TV)、台式计算机等等的固定终端。图4示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图4所示,电子设备400可以包括处理装置(例如中央处理器、图形处理器等)401,处理装置401可以根据存储在只读存储器(Read-only Memory,ROM)402中的程序或者从存储装置408加载到随机访问存储器(Random Access Memory,RAM)403中的程序而执行多种适当的动作和处理。在RAM 403中,还存储有电子设备400操作所需的多种程序和数据。处理装置401、ROM 402以及RAM 403通过总线404彼此相连。输入/输出(Input/Output,I/O)接口405也连接至总线404。
以下装置可以连接至I/O接口405:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置406;包括例如液晶显示器(Liquid Crystal Display,LCD)、扬声器、振动器等的输出装置407;包括例如磁带、硬盘等的存储装置408;以及通信装置409。通信装置409可以允许电子设备400与其他设备进行无线或有线通信以交换数据。虽然图4示出了具有多种装置的电子设备400,但是并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置409从网络上被下载和安装,或者从存储装置408被安装,或者从ROM 402被安装。在该计算机程序被处理装置401执行时,执行本公开实施例的方法中限定的上述功能。
本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质可以包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、ROM、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact  Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。存储介质可以是非暂态(non-transitory)存储介质。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括:电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如超文本传输协议(HyperText Transfer Protocol,HTTP)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备内部进程执行:基于第一训练样本数据对初始唤醒模型进行训练获得种子唤醒模型,其中,第一训练样本数据中包含语音识别数据;对种子唤醒模型进行初始化获得自适应唤醒模型,并基于第二训练样本数据对自适应唤醒模型进行训练,其中,第二训练样本数据中包含语音识别数据和唤醒词数据;采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络包括LAN或WAN连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开多种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在一种情况下并不构成对该单元本身的限定。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、片上系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、RAM、ROM、EPROM、快闪存储器、光纤、便捷式CD-ROM、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,【示例1】提供了一种语音唤醒方法,包括:
基于第一训练样本数据对初始唤醒模型进行训练获得种子唤醒模型,其中,所述第一训练样本数据中包含语音识别数据;
对所述种子唤醒模型进行初始化获得自适应唤醒模型,并基于第二训练样 本数据对所述自适应唤醒模型进行训练,其中,所述第二训练样本数据中包含语音识别数据和唤醒词数据;
采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒。
根据本公开的一个或多个实施例,【示例2】提供了示例1的方法,所述唤醒模型包括循环神经网络转换机RNN-T模型,所述RNN-T模型包括编码器网络、预测网络和联合网络,并且所述联合网络分别与所述编码器网络和所述预测网络连接。
根据本公开的一个或多个实施例,【示例3】提供了示例2的方法,所述基于第一训练样本数据对初始唤醒模型进行训练获得种子唤醒模型,包括:
获取第一初始样本数据;
对所述第一初始样本数据进行扩充获得所述第一训练样本数据;
基于所述第一训练样本数据对所述初始RNN-T模型进行训练获得种子RNN-T模型。
根据本公开的一个或多个实施例,【示例4】提供了示例3的方法,所述对所述种子唤醒模型进行初始化获得自适应唤醒模型,包括:
在所述种子RNN-T模型的基础上添加FFNN,其中,所述FFNN与所述编码器网络连接;
将所述种子RNN-T模型作为第一分支,将所述FFNN和所述编码器网络作为第二分支;
根据所述第一分支和所述第二分支获得所述自适应唤醒模型。
根据本公开的一个或多个实施例,【示例5】提供了示例4的方法,所述基于第二训练样本数据对所述自适应唤醒模型进行训练,包括:
获取第二初始样本数据;
对所述第二初始样本数据进行扩充获得所述第二训练样本数据;
基于所述第二训练样本数据对所述自适应唤醒模型进行训练。
根据本公开的一个或多个实施例,【示例6】提供了示例5的方法,所述基于所述第二训练样本数据对所述自适应唤醒模型进行训练,包括:
基于第二训练样本数据中的语音识别数据对所述第一分支进行训练获得第一损失函数结果;
基于第二训练样本数据中的语音识别数据和唤醒词数据对所述第二分支进 行训练获得第二损失函数结果;
确定所述第一损失函数结果和所述第二损失函数结果的损失函数加权和,当所述损失函数加权和小于预设损失阈值时,则确定所述自适应唤醒模型训练完成。
根据本公开的一个或多个实施例,【示例7】提供了示例4的方法,所述采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒,包括:
将所述待测数据输入训练后的自适应唤醒模型,分别获得所述第一分支的第一预测概率值,以及所述第二分支的第二预测概率值;
确定所述第一预测概率值和所述第二预测概率值的概率加权和;
当所述概率加权和大于预设概率阈值时,将所述概率加权和所对应的符号作为进行语音唤醒的关键词。
根据本公开的一个或多个实施例,【示例8】提供了一种语音唤醒装置,包括:
种子唤醒模型获取模块,设置为基于第一训练样本数据对初始唤醒模型进行训练获得种子唤醒模型,其中,所述第一训练样本数据中包含语音识别数据;自适应唤醒模型训练模块,设置为对所述种子唤醒模型进行初始化获得自适应唤醒模型,并基于第二训练样本数据对所述自适应唤醒模型进行训练,其中,所述第二训练样本数据中包含语音识别数据和唤醒词数据;
语音唤醒模块,设置为采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒。
根据本公开的一个或多个实施例,【示例9】提供了示例8的装置,所述唤醒模型包括循环神经网络转换机RNN-T模型,所述RNN-T模型包括编码器网络、预测网络和联合网络,并且所述联合网络分别与所述编码器网络和所述预测网络连接。
根据本公开的一个或多个实施例,【示例10】提供了示例9的装置,种子唤醒模型获取模块,设置为获取第一初始样本数据;
对所述第一初始样本数据进行扩充获得所述第一训练样本数据;
基于所述第一训练样本数据对所述初始RNN-T模型进行训练获得种子RNN-T模型。
根据本公开的一个或多个实施例,【示例11】提供了示例10的装置,自适应唤醒模型训练模块包括自适应唤醒获取子模块,设置为在所述种子RNN-T模 型的基础上添加FFNN,其中,所述FFNN与所述编码器网络连接;
将所述种子RNN-T模型作为第一分支,将所述FFNN和所述编码器网络作为第二分支;
根据所述第一分支和所述第二分支获得所述自适应唤醒模型。
根据本公开的一个或多个实施例,【示例12】提供了示例11的装置,自适应唤醒模型训练模块包括自适应唤醒模型训练子模块,设置为获取第二初始样本数据;
对所述第二初始样本数据进行扩充获得所述第二训练样本数据;
基于所述第二训练样本数据对所述自适应唤醒模型进行训练。
根据本公开的一个或多个实施例,【示例13】提供了示例12的装置,自适应唤醒模型训练子模块还设置为:基于第二训练样本数据中的语音识别数据对所述第一分支进行训练获得第一损失函数结果;
基于第二训练样本数据中的语音识别数据和唤醒词数据对所述第二分支进行训练获得第二损失函数结果;
确定所述第一损失函数结果和所述第二损失函数结果的损失函数加权和,当所述损失函数加权和小于预设损失阈值时,则确定所述自适应唤醒模型训练完成。
根据本公开的一个或多个实施例,【示例14】提供了示例11的装置,语音唤醒模块设置为:将所述待测数据输入训练后的自适应唤醒模型,分别获得所述第一分支的第一预测概率值,以及所述第二分支的第二预测概率值;
确定所述第一预测概率值和所述第二预测概率值的概率加权和;
当所述概率加权和大于预设概率阈值时,将所述概率加权和所对应的符号作为进行语音唤醒的关键词。
根据本公开的一个或多个实施例,【示例15】提供了一种电子设备,包括:
一个或多个处理器;
存储装置,设置为存储一个或多个程序,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如示例1-示例7中任一所述的方法。
根据本公开的一个或多个实施例,【示例16】提供了一种包含计算机可执行指令的存储介质,存储有计算机程序,其中,该计算机程序被处理器执行时实现如示例1-示例7中任一所述的方法。
此外,虽然采用特定次序描绘了多个操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了多个实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的一些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的多种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。

Claims (10)

  1. 一种语音唤醒方法,包括:
    基于第一训练样本数据对初始唤醒模型进行训练获得种子唤醒模型,其中,所述第一训练样本数据中包含语音识别数据;
    对所述种子唤醒模型进行初始化获得自适应唤醒模型,并基于第二训练样本数据对所述自适应唤醒模型进行训练,其中,所述第二训练样本数据中包含语音识别数据和唤醒词数据;
    采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒。
  2. 根据权利要求1所述的方法,其中,所述唤醒模型包括循环神经网络转换机RNN-T模型,所述RNN-T模型包括编码器网络、预测网络和联合网络,并且所述联合网络分别与所述编码器网络和所述预测网络连接。
  3. 根据权利要求2所述的方法,其中,所述基于第一训练样本数据对初始唤醒模型进行训练获得种子唤醒模型,包括:
    获取第一初始样本数据;
    对所述第一初始样本数据进行扩充获得所述第一训练样本数据;
    基于所述第一训练样本数据对初始RNN-T模型进行训练获得种子RNN-T模型。
  4. 根据权利要求3所述的方法,其中,所述对所述种子唤醒模型进行初始化获得自适应唤醒模型,包括:
    在所述种子RNN-T模型的基础上添加前向神经网络FFNN,其中,所述FFNN与所述编码器网络连接;
    将所述种子RNN-T模型作为第一分支,将所述FFNN和所述编码器网络作为第二分支;
    根据所述第一分支和所述第二分支获得所述自适应唤醒模型。
  5. 根据权利要求4所述的方法,其中,所述基于第二训练样本数据对所述自适应唤醒模型进行训练,包括:
    获取第二初始样本数据;
    对所述第二初始样本数据进行扩充获得所述第二训练样本数据;
    基于所述第二训练样本数据对所述自适应唤醒模型进行训练。
  6. 根据权利要求5所述的方法,其中,所述基于所述第二训练样本数据对所 述自适应唤醒模型进行训练,包括:
    基于所述第二训练样本数据中的语音识别数据对所述第一分支进行训练获得第一损失函数结果;
    基于所述第二训练样本数据中的语音识别数据和唤醒词数据对所述第二分支进行训练获得第二损失函数结果;
    确定所述第一损失函数结果和所述第二损失函数结果的损失函数加权和,在所述损失函数加权和小于预设损失阈值的情况下,确定所述自适应唤醒模型训练完成。
  7. 根据权利要求4所述的方法,其中,所述采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒,包括:
    将所述待测数据输入所述训练后的自适应唤醒模型,分别获得所述第一分支的第一预测概率值,以及所述第二分支的第二预测概率值;
    确定所述第一预测概率值和所述第二预测概率值的概率加权和;
    在所述概率加权和大于预设概率阈值的情况下,将所述概率加权和所对应的符号作为进行语音唤醒的关键词。
  8. 一种语音唤醒装置,包括:
    种子唤醒模型获取模块,设置为基于第一训练样本数据对初始唤醒模型进行训练获得种子唤醒模型,其中,所述第一训练样本数据中包含语音识别数据;
    自适应唤醒模型训练模块,设置为对所述种子唤醒模型进行初始化获得自适应唤醒模型,并基于第二训练样本数据对所述自适应唤醒模型进行训练,其中,所述第二训练样本数据中包含语音识别数据和唤醒词数据;
    语音唤醒模块,设置为采用训练后的自适应唤醒模型对待测数据进行关键词检测,以实现语音唤醒。
  9. 一种电子设备,包括:
    至少一个处理器;
    存储装置,设置为存储至少一个程序,
    当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-7中任一项所述的语音唤醒方法。
  10. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1-7中任一项所述的语音唤醒方法。
PCT/CN2021/135387 2020-12-14 2021-12-03 语音唤醒方法、装置、电子设备及存储介质 WO2022127620A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011474857.9 2020-12-14
CN202011474857.9A CN112712801B (zh) 2020-12-14 2020-12-14 一种语音唤醒方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022127620A1 true WO2022127620A1 (zh) 2022-06-23

Family

ID=75542087

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/135387 WO2022127620A1 (zh) 2020-12-14 2021-12-03 语音唤醒方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN112712801B (zh)
WO (1) WO2022127620A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115064160A (zh) * 2022-08-16 2022-09-16 阿里巴巴(中国)有限公司 语音唤醒方法以及装置
CN117079653A (zh) * 2023-10-11 2023-11-17 荣耀终端有限公司 语音识别方法、语音识别模型的训练方法、设备及介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112712801B (zh) * 2020-12-14 2024-02-02 北京有竹居网络技术有限公司 一种语音唤醒方法、装置、电子设备及存储介质
CN113593546B (zh) * 2021-06-25 2023-09-15 青岛海尔科技有限公司 终端设备唤醒方法和装置、存储介质及电子装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009027980A1 (en) * 2007-08-28 2009-03-05 Yissum Research Development Company Of The Hebrew University Of Jerusalem Method, device and system for speech recognition
CN107123417A (zh) * 2017-05-16 2017-09-01 上海交通大学 基于鉴别性训练的定制语音唤醒优化方法及系统
CN110491373A (zh) * 2019-08-19 2019-11-22 Oppo广东移动通信有限公司 模型训练方法、装置、存储介质及电子设备
CN111312222A (zh) * 2020-02-13 2020-06-19 北京声智科技有限公司 一种唤醒、语音识别模型训练方法及装置
CN111508481A (zh) * 2020-04-24 2020-08-07 展讯通信(上海)有限公司 语音唤醒模型的训练方法、装置、电子设备及存储介质
CN111640426A (zh) * 2020-06-10 2020-09-08 北京百度网讯科技有限公司 用于输出信息的方法和装置
CN111667818A (zh) * 2020-05-27 2020-09-15 北京声智科技有限公司 一种训练唤醒模型的方法及装置
CN112712801A (zh) * 2020-12-14 2021-04-27 北京有竹居网络技术有限公司 一种语音唤醒方法、装置、电子设备及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10719115B2 (en) * 2014-12-30 2020-07-21 Avago Technologies International Sales Pte. Limited Isolated word training and detection using generated phoneme concatenation models of audio inputs
WO2018086033A1 (en) * 2016-11-10 2018-05-17 Nuance Communications, Inc. Techniques for language independent wake-up word detection
US11100923B2 (en) * 2018-09-28 2021-08-24 Sonos, Inc. Systems and methods for selective wake word detection using neural network models

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009027980A1 (en) * 2007-08-28 2009-03-05 Yissum Research Development Company Of The Hebrew University Of Jerusalem Method, device and system for speech recognition
CN107123417A (zh) * 2017-05-16 2017-09-01 上海交通大学 基于鉴别性训练的定制语音唤醒优化方法及系统
CN110491373A (zh) * 2019-08-19 2019-11-22 Oppo广东移动通信有限公司 模型训练方法、装置、存储介质及电子设备
CN111312222A (zh) * 2020-02-13 2020-06-19 北京声智科技有限公司 一种唤醒、语音识别模型训练方法及装置
CN111508481A (zh) * 2020-04-24 2020-08-07 展讯通信(上海)有限公司 语音唤醒模型的训练方法、装置、电子设备及存储介质
CN111667818A (zh) * 2020-05-27 2020-09-15 北京声智科技有限公司 一种训练唤醒模型的方法及装置
CN111640426A (zh) * 2020-06-10 2020-09-08 北京百度网讯科技有限公司 用于输出信息的方法和装置
CN112712801A (zh) * 2020-12-14 2021-04-27 北京有竹居网络技术有限公司 一种语音唤醒方法、装置、电子设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115064160A (zh) * 2022-08-16 2022-09-16 阿里巴巴(中国)有限公司 语音唤醒方法以及装置
CN117079653A (zh) * 2023-10-11 2023-11-17 荣耀终端有限公司 语音识别方法、语音识别模型的训练方法、设备及介质

Also Published As

Publication number Publication date
CN112712801A (zh) 2021-04-27
CN112712801B (zh) 2024-02-02

Similar Documents

Publication Publication Date Title
WO2022127620A1 (zh) 语音唤醒方法、装置、电子设备及存储介质
CN111008533B (zh) 一种翻译模型的获取方法、装置、设备和存储介质
US11270690B2 (en) Method and apparatus for waking up device
US20230306954A1 (en) Speech synthesis method, apparatus, readable medium and electronic device
JP7046134B2 (ja) 情報を出力するための方法および装置
CN111597825B (zh) 语音翻译方法、装置、可读介质及电子设备
CN111046677B (zh) 一种翻译模型的获取方法、装置、设备和存储介质
WO2022228041A1 (zh) 翻译模型的训练方法、装置、设备和存储介质
US11783808B2 (en) Audio content recognition method and apparatus, and device and computer-readable medium
CN110516159B (zh) 一种信息推荐方法、装置、电子设备及存储介质
WO2022116821A1 (zh) 基于多语言机器翻译模型的翻译方法、装置、设备和介质
CN111968647B (zh) 语音识别方法、装置、介质及电子设备
WO2022228221A1 (zh) 信息翻译方法、装置、设备和存储介质
WO2023005729A1 (zh) 语音信息处理方法、装置和电子设备
CN116863935B (zh) 语音识别方法、装置、电子设备与计算机可读介质
WO2022116819A1 (zh) 模型训练方法及装置、机器翻译方法及装置、设备、存储介质
CN112309384B (zh) 一种语音识别方法、装置、电子设备及介质
CN112562633A (zh) 一种歌唱合成方法、装置、电子设备及存储介质
CN113051933B (zh) 模型训练方法、文本语义相似度确定方法、装置和设备
CN113299285A (zh) 设备控制方法、装置、电子设备及计算机可读存储介质
WO2023011397A1 (zh) 一种生成声学特征、语音模型训练、语音识别方法及装置
CN116072108A (zh) 模型生成方法、语音识别方法、装置、介质及设备
CN113488050B (zh) 语音唤醒方法、装置、存储介质及电子设备
WO2022121859A1 (zh) 口语信息处理方法、装置和电子设备
WO2022134968A1 (zh) 模型的训练方法、语音识别方法、装置、介质及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21905549

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21905549

Country of ref document: EP

Kind code of ref document: A1