CN115862604B

CN115862604B - Voice awakening model training and voice awakening method and device and computer equipment

Info

Publication number: CN115862604B
Application number: CN202211481741.7A
Authority: CN
Inventors: 李蒙
Original assignee: Mgjia Beijing Technology Co ltd
Current assignee: Mgjia Beijing Technology Co ltd
Priority date: 2022-11-24
Filing date: 2022-11-24
Publication date: 2024-02-20
Anticipated expiration: 2042-11-24
Also published as: CN115862604A

Abstract

The invention provides a voice awakening model training method, a voice awakening model training device and computer equipment, wherein the method comprises the following steps: acquiring voice sample data, wherein the voice sample data is provided with labels related to wake-up words; and synchronously performing voice wake-up classification, sequence alignment labeling and phoneme recognition on a wake-up model formed by the encoder and the decoder based on voice sample data to obtain a target voice wake-up model. The invention can utilize the multi-task training learning to synchronously learn different information useful for waking up by the waking up model, and further realize the complementation between the information. Through the complementation between useful information to awakening, the accuracy of the voice awakening model to voice recognition is improved, false awakening is avoided to the greatest extent, the voice awakening model realizes balance between false awakening and false refusing, false awakening is reduced to the greatest extent under the condition of ensuring the awakening rate, and user experience is improved.

Description

Voice awakening model training and voice awakening method and device and computer equipment

Technical Field

The invention relates to the technical field of computers, in particular to a voice awakening model training method, a voice awakening device and computer equipment.

Background

Before voice interaction, the device needs to be awakened first, and then enters a working state from a dormant state, so that instructions of users can be normally processed. Waking up the device from a sleep state to a working state is called, and there are usually wake-on-touch (lock screen key), wake-on-timing (alarm clock), passive wake-up (telephone), etc., and voice wake-up is: the device is switched from the dormant state to the active state by voice.

In the related art, a main difficulty of the voice wake-up technology is to balance the false wake-up and false rejection, that is, to promote the wake-up effect under the condition of ensuring a lower false wake-up rate, the current mainstream scheme usually sacrifices the wake-up rate in order to achieve the lower false wake-up, and increases more false rejection, while the low wake-up rate affects the user experience.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the defects in the prior art, thereby providing a voice wake-up model training and voice wake-up method, a device and computer equipment.

According to a first aspect, the present invention provides a method for training a voice wake model, the method comprising:

acquiring voice sample data, wherein the voice sample data is provided with labels related to wake-up words;

and synchronously performing voice wake-up classification, sequence alignment labeling and phoneme recognition on a wake-up model formed by the encoder and the decoder based on the voice sample data to obtain a target voice wake-up model.

In this way, the information useful for waking up can be learned synchronously by using the multi-task training learning, so that the waking up model can learn different information useful for waking up synchronously, and the complementation between the information can be further realized. Through the complementation between useful information to awakening, the accuracy of the voice awakening model to voice recognition is improved, false awakening is avoided to the greatest extent, the voice awakening model realizes balance between false awakening and false refusing, false awakening is reduced to the greatest extent under the condition of ensuring the awakening rate, and user experience is improved.

With reference to the first aspect, in a first embodiment of the first aspect, the training for performing a learning of a voice wake-up classification on a wake-up model formed by an encoder and a decoder based on the voice sample data includes:

inputting the voice sample data into the encoder to obtain output characteristics;

after averaging the output characteristics in a time sequence dimension, inputting the averaged output characteristics into a layer of fully-connected network to perform first classification, and performing supervised training by utilizing labels related to wake-up words in the voice sample data, wherein the classification of the first classification comprises: all wake words and a category representing non-wake words.

With reference to the first aspect, in a second embodiment of the first aspect, the training for performing sequence alignment labeling on a wake-up model formed by an encoder and a decoder based on the voice sample data includes:

inputting the output characteristics into a layer of fully-connected network to carry out second classification, and carrying out supervised training by utilizing labels related to wake-up words in the voice sample data, wherein the categories of the second classification comprise: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

With reference to the first aspect, in a third embodiment of the first aspect, the learning training for performing phoneme recognition on a wake model formed by an encoder and a decoder based on the speech sample data includes:

inputting the output characteristics into the decoder for acoustic unit integration and third classification, and performing supervised training by using labels related to wake-up words in the voice sample data, wherein the third classification comprises: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

With reference to the first aspect, in a fourth embodiment of the first aspect, the voice sample data includes: original voice sample data and enhanced voice sample data, wherein the enhanced voice sample data is voice sample data obtained by performing voice enhancement processing on the original voice sample data, and the method further comprises:

respectively calculating corresponding model output distribution of the original voice sample data and the enhanced voice sample data in the learning training process of performing voice wake-up classification and sequence alignment labeling on the wake-up model;

calculating KL divergence loss based on model output distribution corresponding to the original voice sample data and the enhanced voice sample data distribution;

training the wake model based on the KL divergence loss.

According to a second aspect, the present invention also provides a voice wake-up method, including:

performing voice wake-up classification on the target voice input by using the target voice wake-up model obtained by training the voice wake-up model training method according to any one of the first aspect and the optional implementation manner thereof to obtain a first wake-up word;

performing sequence alignment labeling on a target voice input by using a target voice wake model obtained by training the voice wake model training method according to any one of the first aspect and optional implementation manners thereof;

comparing whether the sequence alignment labeling result contains the first wake-up word or not;

when the sequence alignment labeling result contains the first wake-up word, inputting the output characteristics of the target voice wake-up model encoder corresponding to the wake-up word position in the sequence alignment labeling result into a decoder for phoneme recognition;

comparing whether phonemes, mute classes and other pronunciation classes contained in the wake-up words in the phoneme recognition result are consistent with the first wake-up words or not;

and when phonemes, mute classes and other pronunciation classes contained in the wake-up words in the phoneme recognition result are consistent with the first wake-up words, carrying out wake-up operation on the target object based on the first wake-up words.

In the mode, the voice awakening model obtained by combining training of different tasks is utilized, and the multi-stage awakening process is carried out by combining the advantages of the different tasks, so that the false awakening can be controlled at the minimum false rejection cost, the false rejection rate and the false awakening rate are greatly reduced, the balance of the false rejection rate and the false awakening rate is realized, and the use experience of a user is further improved.

With reference to the second aspect, in a first embodiment of the second aspect, when the voice wake-up classification result is a non-wake-up word category, or when the sequence alignment labeling result does not include the first wake-up word, or when phonemes, silence classes, and other pronunciation classes included in the wake-up word in the phoneme recognition result are inconsistent with the first wake-up word, wake-up operation is refused to be performed on the target object.

According to a third aspect, the present invention also provides a voice wake model training device, the device comprising:

the voice recognition device comprises an acquisition unit, a recognition unit and a recognition unit, wherein the acquisition unit is used for acquiring voice sample data, and labels related to wake-up words are in the voice sample data;

and the training unit is used for synchronously carrying out voice wake-up classification, sequence alignment labeling and phoneme recognition on the wake-up model formed by the encoder and the decoder based on the voice sample data to obtain a target voice wake-up model.

With reference to the third aspect, in a first embodiment of the third aspect, the training unit includes:

the first input unit is used for inputting the voice sample data into the encoder to obtain output characteristics;

the first training unit is configured to average the output characteristics in a time sequence dimension, input the averaged output characteristics into a layer of fully connected network to perform a first classification, and perform supervised training by using labels related to wake-up words in the voice sample data, where the classification of the first classification includes: all wake words and a category representing non-wake words.

With reference to the third aspect, in a second embodiment of the third aspect, the training unit includes:

the second input unit is used for inputting the voice sample data into the encoder to obtain output characteristics;

the second training unit is configured to input the output feature into a layer of fully connected network to perform a second classification, and perform supervised training by using labels related to wake-up words in the voice sample data, where the second classification includes: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

With reference to the second aspect, in a third embodiment of the second aspect, the training unit includes:

a third input unit for inputting the voice sample data to the encoder to obtain an output characteristic;

and a third training unit, configured to input the output feature into the decoder for acoustic unit integration and third classification, and perform supervised training using labels related to wake-up words in the speech sample data, where the third classification includes: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

With reference to the third aspect, in a fourth embodiment of the third aspect, the training unit further includes:

the first calculation unit is used for calculating corresponding model output distribution of the original voice sample data and the enhanced voice sample data in the learning training process of performing voice wake-up classification and sequence alignment labeling on the wake-up model respectively;

a second calculation unit, configured to calculate KL divergence loss based on a model output distribution corresponding to the original speech sample data and the enhanced speech sample data distribution;

and the fourth training unit is used for training the wake-up model based on the KL divergence loss.

According to a fourth aspect, the present invention also provides a voice wake-up device, the device comprising:

the voice wake-up classification unit is used for performing voice wake-up classification on the target voice input by using the target voice wake-up model obtained by training the voice wake-up model training method in any one of the first aspect and the optional implementation manners thereof to obtain a first wake-up word;

the sequence alignment labeling unit is used for performing sequence alignment labeling on the target voice input by adopting the target voice wake-up model obtained by training the voice wake-up model training method according to any one of the first aspect and the optional implementation manners thereof;

the first comparison unit is used for comparing whether the sequence alignment labeling result contains the first wake-up word or not;

the phoneme recognition unit is used for inputting the output characteristics of the target voice awakening model encoder corresponding to the awakening word position in the sequence alignment labeling result into the decoder for phoneme recognition when the first awakening word is contained in the sequence alignment labeling result;

the second comparison unit is used for comparing whether phonemes, silence classes and other pronunciation classes contained in the wake-up words in the phoneme recognition result are consistent with the first wake-up words or not;

and the awakening unit is used for carrying out awakening operation on the target object based on the first awakening word when the phonemes, the mute class and other pronunciation classes contained in the awakening word in the phoneme recognition result are consistent with the first awakening word.

With reference to the fourth aspect, in a first embodiment of the fourth aspect, the apparatus further includes:

and the refusing unit is used for refusing the awakening operation on the target object when the voice awakening classification result is a non-awakening word category, or when the sequence alignment labeling result does not contain the first awakening word, or when phonemes, silence classes and other pronunciation classes contained in the awakening word in the phoneme recognition result are inconsistent with the first awakening word.

According to a fifth aspect, the present invention further provides a computer device comprising a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the method of training the voice wake model of any one of the first aspect and its alternative embodiments and the method of voice wake of any one of the second aspect and its alternative embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for training a voice wake model in accordance with an exemplary embodiment.

Fig. 2 is a flow chart of a voice wakeup method according to an example embodiment.

Fig. 3 is a block diagram of a voice wake model training apparatus according to an exemplary embodiment.

Fig. 4 is a block diagram of a voice wake-up device according to an exemplary embodiment.

Fig. 5 is a schematic diagram of a hardware structure of a computer device according to an exemplary embodiment.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the related art, a main difficulty of the voice wake-up technology is to balance the false wake-up and false rejection, that is, to promote the wake-up effect under the condition of ensuring a lower false wake-up rate, so that the wake-up rate is usually sacrificed in order to achieve the lower false wake-up in the current mainstream scheme, and more false rejection is increased.

In order to solve the above-mentioned problems, in the embodiments of the present invention, a voice wake-up model training method is provided for a computer device, and it should be noted that an execution body of the voice wake-up model training device may be a voice wake-up model training device, and the voice wake-up model training device may be implemented by software, hardware or a combination of software and hardware to form part or all of the computer device, where the computer device may be a terminal, a client, or a server, and the server may be a server, or may be a server cluster formed by multiple servers. In the following method embodiments, the execution subject is a computer device.

The computer equipment in the embodiment is suitable for the use scene of voice awakening. The voice wake-up model training method provided by the invention can utilize multitask training learning, so that the wake-up model synchronously learns different information useful for wake-up, and the complementation between the information is further realized. The accuracy of the voice awakening model is improved through complementation between information useful for awakening; through the three-stage wake-up flow, false rejection and false wake-up are avoided to the greatest extent, so that the voice wake-up model realizes balance between false wake-up and false rejection, false wake-up is reduced to the greatest extent under the condition of ensuring high wake-up rate, and user experience is improved.

FIG. 1 is a flowchart of a method for training a voice wake model in accordance with an exemplary embodiment. As shown in fig. 1, the voice wake model training method includes the following steps S101 to S102.

In step S101, voice sample data is acquired.

In the embodiment of the invention, the voice sample data is provided with a label related to the wake-up word, and the label can comprise: labels of wake words include labels of phonemes, silence classes, other pronunciation classes, and the like.

In step S102, the wake-up model formed by the encoder and the decoder is synchronously subjected to the multi-task learning training of voice wake-up classification, sequence alignment labeling and phoneme recognition based on the voice sample data, so as to obtain a target voice wake-up model.

In the embodiment of the invention, after voice sample data are received, a multi-stage voice awakening model is trained by a multi-task learning method in order to accurately awaken a target object. Wherein the wake-up model is composed of an encoder and a decoder. In an example, the wake-up model backbone structure is divided into two parts, namely an encoder and a decoder, wherein the encoder is a 12-layer gMPL model, and the decoder is composed of a CIF continuous integration issuing module and a one-layer transducer decoder.

In an embodiment of the present invention, learning training for performing a voice wake-up classification on a wake-up model composed of an encoder and a decoder based on voice sample data includes: inputting the voice sample data into an encoder to obtain output characteristics; after averaging the output characteristics in the time sequence dimension, inputting the averaged output characteristics into a layer of fully-connected network for first classification, and performing supervised training by utilizing labels related to wake-up words in voice sample data, wherein the categories of the first classification comprise: all wake words and a category representing non-wake words.

In one example, after inputting the speech sample data into the encoder to obtain the output features, performing the speech wake up classification learning task may include: and averaging the output characteristics of the voice sample obtained by the encoder in a time sequence dimension, inputting the averaged output characteristics of the voice sample into a layer of fully-connected network, performing supervised training on the wake-up model by adopting cross entropy loss, and obtaining a first classification category comprising all wake-up words and a category representing non-wake-up words.

In the embodiment of the invention, the learning training for performing sequence alignment labeling on the wake-up model formed by the encoder and the decoder based on the voice sample data comprises the following steps: inputting the voice sample data into an encoder to obtain output characteristics; inputting the output characteristics into a layer of fully-connected network to carry out second classification, and carrying out supervised training by utilizing labels related to wake-up words in voice sample data, wherein the second classification comprises the following steps: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

In one example, after inputting the speech sample data into the encoder to obtain the output features, performing the sequence alignment annotation learning task may include: and inputting the voice sample output characteristics obtained through the encoder into a layer of full-connection network to carry out second classification, and carrying out supervised training on the wake-up model by adopting cross entropy loss on the result of each frame of the voice sample obtained by inputting the full-connection network, wherein the categories of the second classification comprise phonemes, silence classes and other pronunciation classes contained in the wake-up word.

In an embodiment of the present invention, learning training for phoneme recognition of a wake-up model constituted by an encoder and a decoder based on speech sample data includes: inputting the voice sample data into an encoder to obtain output characteristics; the output features are input into the decoder for acoustic unit integration and third classification, and supervised training is performed by using labels related to wake-up words in voice sample data, wherein the third classification comprises: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

In one example, after inputting the speech sample data into the encoder to obtain the output features, performing the phoneme recognition learning task may include: inputting the voice sample output characteristics obtained through the encoder into a continuous integration and release CIF module to obtain integrated acoustic unit characteristics; inputting the acoustic unit characteristics into a transducer decoder for prediction to obtain phoneme labels corresponding to the acoustic unit characteristics; the loss of the phoneme recognition task is obtained by supervised training through cross entropy loss, and the categories comprise phonemes, silence classes and other pronunciation classes contained in the wake-up words.

Through the embodiment, the information which is different and is useful for waking up can be synchronously learned by using the multi-task training learning, so that the information complementation is further realized. By complementing the information useful for waking up, the accuracy of the voice wake-up model is improved.

In one embodiment, to make the target voice wake model robust, the target voice model is trained by performing a self-distilling task. Through self-distillation task training, the target voice awakening model can adapt to noise, the anti-interference capability of the target voice awakening model is improved, the robustness of the target voice awakening model is further improved, and the voice awakening performance of the target voice awakening model in a noise environment is further improved. In one example, a method of obtaining enhanced speech sample data may include: and performing voice enhancement processing on voice training sample audio, wherein the enhancement methods comprise volume disturbance, tone disturbance, noise adding, reverberation adding and the like.

In an embodiment of the present invention, the self-distillation task training method may include: respectively calculating corresponding model output distribution of the original voice sample data and the enhanced voice sample data in the learning training process of performing voice wake-up classification and sequence alignment labeling on the wake-up model; calculating KL divergence loss based on model output distribution corresponding to original voice sample data and enhanced voice sample data distribution; the wake model is trained based on KL divergence loss.

In one example, a self-distillation training method may include: respectively calculating corresponding model output distribution of the original voice sample data and the enhanced voice sample data in the learning training process of performing voice wake-up classification and sequence alignment labeling on the wake-up model; calculating KL divergence loss based on model output distribution corresponding to original voice sample data and enhanced voice sample data distribution; the wake model is trained based on KL divergence loss.

Fig. 2 is a flow chart of a voice wakeup method according to an example embodiment. As shown in fig. 2, the voice wakeup method includes the following steps.

In step S201, performing voice wake-up classification on a target voice input by using the target voice wake-up model obtained by training by using any voice wake-up model training method in the above embodiment to obtain a first wake-up word;

in step S202, performing sequence alignment labeling on a target voice wake-up model obtained by inputting target voice and training the target voice wake-up model by using any voice wake-up model training method in the above embodiment;

in step S203, whether the sequence alignment labeling result includes a first wake-up word is compared;

in step S204, when the wake-up word in the sequence alignment labeling result includes a first wake-up word, inputting the output feature of the target voice wake-up model encoder corresponding to the wake-up word position in the sequence alignment labeling result into a decoder for phoneme recognition;

in step S205, comparing whether the phonemes, silence class, other pronunciation class and the first wake-up word contained in the wake-up word in the phoneme recognition result are consistent;

in step S206, when the phonemes, the silence class, and the other pronunciation classes included in the wake-up word in the phoneme recognition result are consistent with the first wake-up word, a wake-up operation is performed on the target object based on the first wake-up word.

In the embodiment of the invention, when the voice wake-up classification result is a non-wake-up word category, or when the sequence alignment labeling result does not contain a first wake-up word, or when phonemes, silence classes, other pronunciation classes contained in the wake-up word in the phoneme recognition result are inconsistent with the first wake-up word, wake-up operation is refused to be performed on the target object.

In one example, the voice wakeup method may be divided into three phases, including:

the target voice wake-up model obtained based on training by the voice wake-up model training method is fast in rejection, and is characterized by being fast and efficient, free of decoding and post-processing processes, high in wake-up rate and high in false wake-up. The specific implementation method comprises the following steps: obtaining a wake-up category corresponding to the voice sample data through voice wake-up classification learning training, and if the wake-up category is one of wake-up words, carrying out corresponding wake-up; otherwise, rejecting.

Performing sequence alignment labeling on the basis of a target voice wake-up model trained by the voice wake-up model training method, and decoding by using a beam search algorithm to obtain a phoneme sequence corresponding to the voice sequence; if the phoneme sequence contains the wake-up word, recording the starting and ending positions of the pronunciation of the wake-up word, and continuing to enter a third stage, otherwise refusing to recognize.

And step three, inputting the output characteristics of the encoder corresponding to the position of the wake-up word in the second stage into a phoneme recognition task module, decoding by using a beamsearch algorithm to obtain a final accurate phoneme sequence in the corresponding position, and if the obtained phoneme sequence contains the wake-up word, carrying out final wake-up, otherwise refusing the recognition.

Through the embodiment, the voice awakening model obtained by combining training of different tasks is utilized, and the multi-stage awakening process is carried out by combining the advantages of the different tasks, so that the false awakening can be controlled at the minimum false rejection cost, the false awakening rate is greatly reduced, the balance of false rejection and false awakening is realized, and the use experience of a user is further improved.

Based on the same inventive concept, the invention also provides a voice wake-up model training device.

Fig. 3 is a block diagram of a voice wake model training apparatus according to an exemplary embodiment. As shown in fig. 3, the voice wake model training apparatus includes an acquisition unit 301 and a training unit 302.

An obtaining unit 301, configured to obtain voice sample data, where the voice sample data has a tag related to a wake-up word;

the training unit 302 is configured to perform multitask learning training of speech wake-up classification, sequence alignment labeling and phoneme recognition on the wake-up model formed by the encoder and the decoder based on the speech sample data, so as to obtain a target speech wake-up model.

In one embodiment, training unit 302 includes: the first input unit is used for inputting the voice sample data into the encoder to obtain output characteristics; the first training unit is used for inputting the output characteristics into a layer of fully-connected network to carry out first classification after averaging in the time sequence dimension, and carrying out supervised training by utilizing labels related to wake-up words in voice sample data, wherein the categories of the first classification comprise: all wake words and a category representing non-wake words.

In another embodiment, training unit 302 comprises: the second input unit is used for inputting the voice sample data into the encoder to obtain output characteristics; the second training unit is configured to input the output features into a layer of fully-connected network to perform a second classification, and perform supervised training by using labels related to wake-up words in the voice sample data, where the second classification includes: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

In yet another embodiment, the training unit 302 comprises: the third input unit is used for inputting the voice sample data into the encoder to obtain output characteristics; the third training unit is used for inputting the output characteristics into the decoder to integrate the acoustic units and perform third classification, and performing supervised training by using labels related to wake-up words in the voice sample data, wherein the third classification comprises the following steps: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

In yet another embodiment, training unit 302 further comprises: the first calculation unit is used for calculating corresponding model output distribution of the original voice sample data and the enhanced voice sample data in the learning training process of performing voice wake-up classification and sequence alignment labeling on the wake-up model respectively; a second calculation unit for calculating KL divergence loss based on a model output distribution corresponding to the original voice sample data and the enhanced voice sample data distribution; and the fourth training unit is used for training the wake-up model based on the KL divergence loss.

The specific limitation of the above-mentioned voice wake-up model training device and the beneficial effects can be referred to the limitation of the above-mentioned voice wake-up model training method, and are not repeated here. The various modules described above may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Fig. 4 is a block diagram of a voice wake-up device according to an exemplary embodiment. As shown in fig. 4, the voice wake-up model training apparatus includes a voice wake-up classification unit 401, a sequence alignment labeling unit 402, a first comparison unit 403, a phoneme recognition unit 404, a second comparison unit 405, and a wake-up unit 406.

The voice wake-up classification unit 401 is configured to perform voice wake-up classification on a target voice input by using the target voice wake-up model obtained by training by using any one of the voice wake-up model training methods to obtain a first wake-up word;

a sequence alignment labeling unit 402, configured to perform sequence alignment labeling on a target voice wake-up model obtained by inputting a target voice and training the target voice through any one of the voice wake-up model training methods;

a first comparing unit 403, configured to compare whether the sequence alignment labeling result includes a first wake-up word;

the phoneme recognition unit 404 is configured to input, when the sequence alignment labeling result includes the first wake-up word, an output feature of the target speech wake-up model encoder corresponding to the wake-up word position in the sequence alignment labeling result into the decoder to perform phoneme recognition;

a second comparing unit 405, configured to compare whether phonemes, silence classes, and other pronunciation classes included in the wake-up words in the phoneme recognition result are consistent with the first wake-up words;

and a wake-up unit 406, configured to perform a wake-up operation on the target object based on the first wake-up word when the phonemes, the silence class, and the other pronunciation classes included in the wake-up word in the phoneme recognition result are consistent with the first wake-up word.

In an embodiment, the apparatus further comprises: and the refusing unit is used for refusing to wake up the target object when the voice wake-up classification result is a non-wake-up word category, or when the sequence alignment labeling result does not contain the first wake-up word, or when the phonemes, the silence class and other pronunciation classes contained in the wake-up word in the phoneme recognition result are inconsistent with the first wake-up word.

The specific limitation of the voice wake-up device and the beneficial effects can be referred to the limitation of the voice wake-up method, and are not repeated here. The various modules described above may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Fig. 5 is a schematic diagram of a hardware structure of a computer device according to an exemplary embodiment. As shown in fig. 5, the device includes one or more processors 510 and a memory 520, the memory 520 including persistent memory, volatile memory and a hard disk, one processor 510 being illustrated in fig. 5. The apparatus may further include: an input device 530 and an output device 540.

The processor 510, memory 520, input device 530, and output device 540 may be connected by a bus or other means, for example in fig. 5.

The processor 510 may be a central processing unit (Central Processing Unit, CPU). Processor 510 may also be a chip such as other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or a combination thereof. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 520 is a non-transitory computer readable storage medium, including persistent memory, volatile memory, and hard disk, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the service management method in the embodiments of the present application. The processor 510 executes various functional applications and data processing of the server by running non-transitory software programs, instructions and modules stored in the memory 520, i.e., implementing any of the voice wake-up model training methods and voice wake-up methods described above.

Memory 520 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data, etc., as needed, used as desired. In addition, memory 520 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 520 may optionally include memory located remotely from processor 510, which may be connected to the data processing device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 may receive input numeric or character information and generate key signal inputs related to user settings and function control. The output 540 may include a display device such as a display screen.

One or more modules are stored in memory 520 that, when executed by one or more processors 510, perform the methods illustrated in fig. 1-2.

The product can execute the method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. Technical details which are not described in detail in the present embodiment can be found in the embodiments shown in fig. 1 to 2.

The embodiment of the invention also provides a non-transitory computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the authentication method in any of the method embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims

1. A method for training a voice wake model, the method comprising:

performing multi-task learning training of voice wake-up classification, sequence alignment labeling and phoneme recognition on a wake-up model formed by an encoder and a decoder synchronously based on the voice sample data to obtain a target voice wake-up model;

the learning training of the voice wake-up classification for the wake-up model formed by the encoder and the decoder based on the voice sample data comprises the following steps: inputting the voice sample data into the encoder to obtain output characteristics; after averaging the output characteristics in a time sequence dimension, inputting the averaged output characteristics into a layer of fully-connected network to perform first classification, and performing supervised training by utilizing labels related to wake-up words in the voice sample data, wherein the classification of the first classification comprises: all wake words and a category representing non-wake words;

the learning training for performing sequence alignment labeling on a wake-up model formed by an encoder and a decoder based on the voice sample data comprises the following steps: inputting the voice sample data into the encoder to obtain output characteristics; inputting the output characteristics into a layer of fully-connected network to carry out second classification, and carrying out supervised training by utilizing labels related to wake-up words in the voice sample data, wherein the categories of the second classification comprise: a phoneme, a mute class and other pronunciation classes contained in the wake-up word;

the learning training of phoneme recognition on the wake-up model formed by the encoder and the decoder based on the voice sample data comprises the following steps: inputting the voice sample data into the encoder to obtain output characteristics; inputting the output characteristics into the decoder for acoustic unit integration and third classification, and performing supervised training by using labels related to wake-up words in the voice sample data, wherein the third classification comprises: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

2. The method of claim 1, wherein the voice sample data comprises: original voice sample data and enhanced voice sample data, wherein the enhanced voice sample data is voice sample data obtained by performing voice enhancement processing on the original voice sample data, and the method further comprises:

training the wake model based on the KL divergence loss.

3. A method of waking up speech, comprising:

performing voice awakening classification on target voice input by using a target voice awakening model obtained by training the voice awakening model training method according to any one of claims 1-2 to obtain a first awakening word;

performing sequence alignment labeling on a target voice input by using a target voice wake model obtained by training the voice wake model training method according to any one of claims 1-2;

when phonemes, mute classes and other pronunciation classes contained in the wake-up words in the phoneme recognition result are consistent with the first wake-up words, carrying out wake-up operation on the target object based on the first wake-up words;

and refusing to wake up the target object when the voice wake up classification result is a non-wake up word category, or when the sequence alignment labeling result does not contain the first wake up word, or when phonemes, silence classes and other pronunciation classes contained in the wake up word in the phoneme recognition result are inconsistent with the first wake up word.

4. A voice wake model training apparatus, the apparatus comprising:

the training unit is used for synchronously carrying out voice awakening classification, sequence alignment labeling and phoneme recognition on the awakening model formed by the encoder and the decoder based on the voice sample data to obtain a target voice awakening model;

5. A voice wakeup apparatus, the apparatus comprising:

the voice wake-up classification unit is used for performing voice wake-up classification on the target voice input by using the target voice wake-up model obtained by training the voice wake-up model training method according to any one of claims 1-2 to obtain a first wake-up word;

a sequence alignment labeling unit, configured to perform sequence alignment labeling on a target voice input by using a target voice wake model obtained by training the voice wake model training method according to any one of claims 1-2;

the first comparison unit is used for comparing whether phonemes, silence classes and other pronunciation classes contained in the wake-up words in the sequence alignment labeling result are consistent with the first wake-up words or not;

the phoneme recognition unit is used for inputting the output characteristics of the target voice awakening model encoder corresponding to the awakening word position in the sequence alignment labeling result into the decoder for phoneme recognition when the phonemes, the mute class and other pronunciation classes contained in the awakening word in the sequence alignment labeling result are consistent with the first awakening word;

the wake-up unit is used for carrying out wake-up operation on the target object based on the first wake-up word when phonemes, silence classes and other pronunciation classes contained in the wake-up word in the phoneme recognition result are consistent with the first wake-up word;

6. A computer device comprising a memory and a processor, the memory and the processor being communicatively coupled to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the voice wake model training method of any of claims 1-2 or to perform the voice wake method of claim 3.