CN111312222A

CN111312222A - Awakening and voice recognition model training method and device

Info

Publication number: CN111312222A
Application number: CN202010091382.9A
Authority: CN
Inventors: 陈天峰; 冯大航; 陈孝良; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-02-13
Filing date: 2020-02-13
Publication date: 2020-06-19
Anticipated expiration: 2040-02-13
Also published as: CN111312222B

Abstract

The application relates to the technical field of computers, in particular to a method and a device for training a wake-up and voice recognition model, which are used for acquiring wake-up voice; according to a trained voice recognition model, with the awakening voice as an input parameter, recognizing whether the awakening voice contains a preset awakening word or not, and obtaining a probability score of whether the awakening voice contains the preset awakening word or not, wherein the voice recognition model is obtained through iterative training according to a voice sample set, the voice sample set at least comprises a target awakening word voice sample of a target user, and the target user is a VIP user; and if the probability score is larger than or equal to the preset probability score threshold value, awakening is determined, so that the awakening effect of the target user on the intelligent equipment can be improved.

Description

Awakening and voice recognition model training method and device

Technical Field

The application relates to the technical field of computers, in particular to a method and a device for training a wake-up and voice recognition model.

Background

At present, with the continuous development of artificial intelligence technology, more and more intelligent devices appear, for example, smart sound box, etc., when the user needs to use the intelligent device, the intelligent device needs to be awakened from the sleep state, and the intelligent device can be continuously used, because the awakening effect of the existing intelligent device mainly aims at most of people, therefore, when the tone or tone of a certain specific person is very special, the intelligent device can not be awakened, or the awakening voice needs to be input for many times, and the intelligent device can be awakened, thus the user experience is greatly influenced, therefore, how to improve the effect of awakening the intelligent device by the target user, which becomes a problem to be solved urgently.

Disclosure of Invention

The embodiment of the application provides a method and a device for awakening and speech recognition model training, so as to improve the awakening effect of a target user on intelligent equipment.

The embodiment of the application provides the following specific technical scheme:

a method of waking up, comprising:

acquiring a wake-up voice;

according to a trained voice recognition model, with the awakening voice as an input parameter, recognizing whether the awakening voice contains a preset awakening word or not, and obtaining a probability score of whether the awakening voice contains the preset awakening word or not, wherein the voice recognition model is obtained through iterative training according to a voice sample set, the voice sample set at least comprises a target awakening word voice sample of a target user, and the target user is a VIP user;

and if the probability score is larger than or equal to a preset probability score threshold value, determining to wake up.

Optionally, the speech recognition model is one or a combination of the following: a first speech recognition model and a second speech recognition model;

if the voice sample set comprises a general awakening word voice sample set and a target awakening word voice sample set, the voice recognition model is the first voice recognition model, the general awakening word voice sample set comprises general awakening word voice samples of a plurality of non-target users, and the target awakening word sample set comprises target awakening word voice samples of a plurality of target users;

and if the voice sample set is a target awakening word voice sample set, the voice recognition model is the second voice recognition model.

Optionally, obtaining a probability score of whether the awakening voice includes a preset awakening word specifically includes:

according to a trained first voice recognition model, recognizing whether the awakening voice contains a preset awakening word or not by taking the awakening voice as an input parameter, and obtaining a first probability score of whether the awakening voice contains the preset awakening word or not;

according to a trained second voice recognition model, recognizing whether the awakening voice contains a preset awakening word or not by taking the awakening voice as an input parameter, and obtaining a second probability score of whether the awakening voice contains the preset awakening word or not;

and obtaining the probability score of whether the awakening voice contains a preset awakening word or not according to the first probability score and the second probability score.

A method of speech recognition model training, comprising:

acquiring a voice sample set, wherein the voice sample set at least comprises a target awakening word voice sample of a target user, and the target user is a VIP user;

and inputting the voice sample set into a voice recognition model for training, outputting the probability score of whether the recognized speech sample set contains the preset awakening words or not until a target function of the voice recognition model is converged, and obtaining the trained speech recognition model, wherein the target function is the cross entropy function minimization of the recognition result of whether the recognition result contains the probability score of the preset awakening words or not.

Optionally, the voice sample set includes a general wake word voice sample set and a target wake word voice sample set; the general awakening word voice sample set comprises general awakening word voice samples of a plurality of non-target users, and the target awakening word sample set comprises target awakening word voice samples of a plurality of target users; or, the voice sample set is a target wake word sound sample set.

Optionally, the target wake-up word voice sample set is obtained by performing data simulation on each obtained target voice wake-up word voice sample in a preset data simulation mode, where the data simulation mode at least includes one or any combination of the following modes: changing speech speed, changing intonation and adding noise.

A wake-up unit, comprising:

the acquisition module is used for acquiring the awakening voice;

the processing module is used for identifying whether the awakening voice contains a preset awakening word or not according to a trained voice recognition model by taking the awakening voice as an input parameter, and obtaining a probability score of whether the awakening voice contains the preset awakening word or not, wherein the voice recognition model is obtained through iterative training according to a voice sample set, the voice sample set at least comprises a target awakening word voice sample of a target user, and the target user is a VIP user;

and the determining module is used for determining to wake up if the probability score is determined to be greater than or equal to a preset probability score threshold value.

Optionally, when obtaining the probability score of whether the wake-up speech includes the preset wake-up word, the determining module is specifically configured to:

A speech recognition model training apparatus comprising:

the voice processing device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a voice sample set, the voice sample set at least comprises a target awakening word voice sample of a target user, and the target user is a VIP user;

and the training module is used for inputting the voice sample set into a voice recognition model for training, outputting the probability score of whether the recognized speech sample set contains the preset awakening words or not until a target function of the voice recognition model is converged, and obtaining the trained speech recognition model, wherein the target function is the cross entropy function minimization of the recognition result of whether the recognition result contains the probability score of the preset awakening words or not.

An electronic device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the above-mentioned wake-up method or speech recognition model training method when executing the program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned speech recognition model training method.

In the embodiment of the application, the awakening voice is obtained, according to the voice recognition model obtained through the iterative training of the voice sample set, the voice sample set at least comprises a target awakening word voice sample of a target user, then the awakening voice is used as an input parameter and is input into the voice recognition model, whether the awakening voice contains a preset awakening word or not is recognized, the probability score of whether the awakening voice contains the preset awakening word or not is obtained, and if the probability score is determined to be larger than or equal to a preset probability threshold value, awakening is determined.

Drawings

Fig. 1 is a flowchart of an intelligent device wake-up method in an embodiment of the present application;

FIG. 2 is a flowchart of a method for training a speech recognition model according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a wake-up apparatus according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a speech recognition model training apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, with the continuous development of artificial intelligence technology, more and more intelligent devices, such as an intelligent sound box, etc., appear, when a user needs to use an intelligent device, the intelligent device needs to be woken up from a sleep state to continue using the intelligent device, however, since the wakening effect of the intelligent device in the prior art is mainly aimed at most of people, that is, for example, in 100 people, 99 people can easily waken up the intelligent device after inputting a wakening voice, however, when the tone or tone of a specific person is special, the situation that the intelligent device cannot be woken up or the intelligent device can be waken up only by inputting the wakening voice for many times may occur, so the wakening effect of the intelligent device on the specific user is very poor, which greatly affects the user experience, and therefore, how to improve the wakening effect of the target user on the intelligent device, the problem to be solved is solved.

In the embodiment of the application, the awakening voice is obtained, according to the voice recognition model after iterative training through the voice sample set, the awakening voice is used as an input parameter, whether the awakening voice contains a preset awakening word is recognized, and probability score of whether the awakening voice contains the preset awakening word is obtained, wherein the voice sample set at least comprises a target awakening word voice sample of a target user, and awakening is determined if the probability score is determined to be larger than or equal to a preset probability score threshold value.

Based on the foregoing embodiment, referring to fig. 1, which is a flowchart of a wake-up method in the embodiment of the present application, an execution main body of the wake-up method in the embodiment of the present application is not limited, and for example, the wake-up method may be applied to an intelligent device, a server, and the like, and the following takes the application to the intelligent device as an example, and the wake-up method specifically includes:

step 100: and acquiring the awakening voice.

In the embodiment of the present application, when the wake-up voice is obtained, the wake-up voice of the user may be obtained through a microphone on the smart device, which is not limited in the embodiment of the present application.

The microphones may be, for example, a single microphone, a dual microphone, or multiple microphones, and the multiple microphones may also form different microphone arrays, for example, a linear microphone array, a ring microphone array, or an L microphone array, which is not limited in the embodiments of the present application.

The wake-up voice of the user may be, for example, "power on" or "HELLO," which is not limited in this embodiment of the application.

The intelligent device may be, for example, an intelligent sound box, an intelligent air conditioner, and the like, and is not limited in the embodiment of the present application.

Step 110: and according to the trained voice recognition model, recognizing whether the awakening voice contains a preset awakening word or not by taking the awakening voice as an input parameter, and obtaining a probability score of whether the awakening voice contains the preset awakening word or not.

The voice recognition model is obtained through iterative training according to a voice sample set, the voice sample set at least comprises target awakening word voice samples of a target user, and the target user is a VIP user.

That is to say, in the embodiment of the present application, the wake-up speech input by the user is input into the trained speech recognition model, that is, the wake-up speech is used as an input parameter, and whether the wake-up speech includes the preset wake-up word is recognized, and the output parameter is a probability score of whether the wake-up speech includes the preset wake-up word.

For example, assuming that the preset wake-up word in the trained speech recognition model is "on", when the wake-up speech input by the user is "on", the "on" is input into the trained speech recognition model, and whether the "on" includes the preset wake-up word is recognized, at this time, since the preset wake-up word in the trained speech recognition model is "on", it can be recognized that the wake-up speech includes the preset wake-up word, that is, "on", and the probability of whether the wake-up speech includes the preset wake-up word is obtained as 78 points.

In the embodiment of the present application, the preset wake-up word is not limited, and the probability score of whether the wake-up voice includes the preset wake-up word is also not limited.

Step 120: and if the probability score is larger than or equal to the preset probability score threshold value, awakening is determined.

Specifically, after the awakening voice of the user is obtained, the awakening voice is used as an input parameter according to the trained voice recognition model, whether the awakening voice contains a preset awakening word is recognized, the probability score of whether the awakening voice contains the preset awakening word is obtained, and whether the awakening is carried out is judged according to the obtained probability score.

In the embodiment of the present application, before performing step 120, a probability score of the wake-up voice needs to be obtained first, and it is determined whether to wake up the smart device according to the probability score.

The wake-up device performance indicator in the prior art is mainly for most people, that is, for example, when 100 users wake up the smart device, the probability of waking up the smart device may be as high as 99%, and may not be good for a specific user, for example, the specific user needs to input the wake-up voice many times to wake up the smart device.

For example, when the specific user inputs the wake-up voice "on" for the first time, the intelligent device is not woken up, which means that when the specific user inputs the wake-up voice "on" for the first time, the specific user uses the wake-up voice as an input parameter according to the trained voice recognition model to recognize whether the wake-up voice includes the preset wake-up word "on" and obtains that the probability score of whether the wake-up voice includes the preset wake-up word does not exceed the preset probability score threshold, then, when the specific user inputs the wake-up voice "on" for the second time, whether the specific user can be woken up or not may occur, and when the specific user inputs the wake-up voice "on" for the third time, if the probability score of the wake-up voice at this time is greater than or equal to the preset probability score threshold, the specific user is.

That is to say, in the prior art, because the intonation, the speech speed, and the tone color of different users are different, the awakening device has not so good awakening effect for some users, and therefore, in the embodiment of the present application, a awakening method is provided, which can make the awakening device have better awakening effect for target users, for example, the number of people in a family does not exceed 10, and the real need of the awakening device is to obtain the optimal effect by summarizing the specific 10 people, so that the intelligent device has a certain personalization by the awakening method in the embodiment of the present application, and has good effect for the VIP user to awaken the intelligent device, and can maintain the same effect for non-target users.

Since the speech recognition model is obtained by iterative training according to the speech sample set, the speech sample set at least includes the target wake-up word sound sample of the target user, that is, the speech sample set at least includes the target wake-up word sound sample of the target user, for example, the speech sample set includes the target wake-up word sound sample of the target user "on", at this time, when the wake-up speech input by the target user into the speech recognition model is "on", because the speech sample in the speech recognition model at least includes the target wake-up word sound sample of the target user, the probability score of whether the obtained wake-up speech includes the preset wake-up word is higher than that of the wake-up method in the prior art, and thus, the wake-up device determines to wake up.

Further, the speech recognition model in the embodiment of the present application is obtained after iterative training according to a speech sample set, and the speech sample set at least includes a target wake-up word sound sample of a target user, so that the speech recognition model can be further divided into at least two types:

the first method comprises the following steps: the speech recognition model is a first speech recognition model.

If the speech sample set comprises the general awakening word sound sample set and the target awakening word sound sample set, the speech recognition model is the first speech recognition model.

The general awakening word voice sample set comprises general awakening word voice samples of a plurality of non-target users, and the target awakening word sample set comprises target awakening word voice samples of a plurality of target users.

In an embodiment of the present application, the voice sample set of the first voice recognition model includes a general wake-up word tone sample set and a target wake-up word tone sample set.

The general awakening word voice sample set comprises general awakening word voice samples of a plurality of non-target users, and the general awakening word voice sample set at least comprises a plurality of awakening word voice samples, such as "power on", "what is the weather today", and the like.

Moreover, each wake word sample includes voice samples of a plurality of non-target users, for example, assuming that the wake word is "on", the common wake word voice sample set includes common wake word voice samples of a plurality of non-target users, for example, the common wake word voice samples "on" of 10 year old boys, 7 year old girls, 34 year old women, and 56 year old men, which is not limited in this embodiment of the application.

The target user may be one target user or multiple target users, which is not limited in this embodiment of the application.

The target wake word sound sample set includes target voice wake word sound samples of a plurality of target users, that is, the target wake word sound sample set at least includes target voice wake words of a plurality of target users, for example, "on", "weather-like today", which is not limited in this embodiment of the application.

Moreover, each target voice wake-up word further includes voice samples of a plurality of target users, for example, if the wake-up word is "on", at least the target voice wake-up words of the plurality of target users are included, that is, the target users input a plurality of "on".

When the speech recognition model is the first speech recognition model, when the step of "obtaining the probability score of whether the awakening speech includes the preset awakening word" is executed, the method specifically includes:

s1: and according to the trained first voice recognition model, recognizing whether the awakening voice contains a preset awakening word or not by taking the awakening voice as an input parameter, and obtaining a first probability score of whether the awakening voice contains the preset awakening word or not.

The voice sample set of the first voice recognition model comprises a general awakening word tone sample set and a target awakening word tone sample set, the general awakening word tone sample set comprises general awakening word tone samples of a plurality of non-target users, and the target awakening word sample set comprises target awakening word tone samples of a plurality of target users.

Specifically, according to a trained first voice recognition model, received awakening voice of a user is used as an input parameter of the first voice recognition model, whether the awakening voice contains a preset awakening word or not is recognized through the trained first voice recognition model, and a first probability score of whether the awakening voice contains the preset awakening word or not is obtained.

For example, assuming that the preset wake-up word is "small a fast boot", and the wake-up voice input into the intelligent device by the target user is "small a fast boot", the intelligent device recognizes the wake-up word "small a fast boot" by using the received wake-up voice of the user as an input parameter of the first voice recognition model through the trained first voice recognition model, and obtains a first probability score of whether the wake-up voice includes the preset wake-up word of 89 points, which is not limited in the embodiment of the present application.

In addition, in the embodiment of the present application, when the speech recognition model is the first speech recognition model, the speech sample set includes the general awakening word tone sample set and the target awakening word tone sample set, so that the awakening effect for the target user can be improved, and the awakening effect for the non-target user can be maintained.

S2: and according to the trained second voice recognition model, recognizing whether the awakening voice contains a preset awakening word or not by taking the awakening voice as an input parameter, and obtaining a second probability score of whether the awakening voice contains the preset awakening word or not.

The voice sample set of the second voice recognition model at least comprises a target awakening word voice sample set, and the target awakening word sample set comprises target awakening word voice samples of a plurality of target users.

Specifically, according to the trained second voice recognition model, the received awakening voice of the user is used as an input parameter of the second voice recognition model, whether the awakening voice contains the preset awakening word or not is recognized through the trained second voice recognition model, and a second probability score of whether the awakening voice contains the preset awakening word or not is obtained.

For example, assuming that the preset wake-up word is "small a fast boot", and the wake-up voice input into the intelligent device by the target user is "small a fast boot", the intelligent device recognizes the wake-up word "small a fast boot" by using the received wake-up voice of the user as an input parameter of the second voice recognition model through the trained second voice recognition model, and obtains a second probability score of 93 scores of whether the wake-up voice includes the preset wake-up word, which is not limited in the embodiment of the present application.

S3: and obtaining the probability score of whether the awakening voice contains the preset awakening word or not according to the first probability score and the second probability score.

Specifically, when step S3 is executed, the following two different manners may be adopted, which are examples in this application, and in this application embodiment, other manners may also be adopted to obtain a probability score of whether the wake-up speech includes the preset wake-up word, which is not limited in this application embodiment.

The first mode is as follows: and obtaining the probability score of whether the awakening voice contains the preset awakening word or not according to the sum of the first probability score and the second probability score.

When step S3 is executed, the method specifically includes:

and adding the first probability score and the second probability score to obtain the probability score of whether the awakening voice contains the preset awakening word or not.

For example, assuming that the first probability score is 89 points and the second probability score is 93 points, the probability score of whether the wake-up speech includes the preset wake-up word is 182 points, which is not limited in the embodiment of the present application.

The second mode is as follows: and obtaining a probability score according to the weight.

When step S3 is executed, the method specifically includes:

and obtaining the probability score of whether the awakening voice contains the preset awakening word or not according to the first probability score, the second probability score and the weight of the first probability score and the second probability score.

For example, assuming that the first probability score is 89 points, the second probability score is 93 points, the weight of the first probability score is 0.4, and the weight of the second probability score is 0.6, it can be obtained that the probability score of whether the wake-up voice includes the preset wake-up word is 89 × 0.4+93 × 0.6, or 91.4 points, that is, the probability score of whether the wake-up voice includes the preset wake-up word is 91.4 points, which is not limited in the embodiment of the present application.

And the second method comprises the following steps: the speech recognition model is a second speech recognition model.

And if the voice sample set is the target awakening word voice sample set, the voice recognition model is a second voice recognition model.

The second voice recognition model comprises a target awakening word voice sample set, and the target awakening word sample set comprises target awakening word voice samples of a plurality of target users.

In the embodiment of the application, when the voice recognition model is the second voice recognition model, the probability score of whether the awakening voice contains the preset awakening word or not can be directly obtained.

For example, assuming that the preset wake-up word is "how much weather is present", and the wake-up voice input into the intelligent device by the target user is "how much weather is present", the intelligent device recognizes the received wake-up voice of the user "how much weather is present" as an input parameter of the second voice recognition model through the trained second voice recognition model, and obtains a probability that whether the wake-up voice includes the preset wake-up word, which is 87 points, which is not limited in the embodiment of the present application.

Further, if the main execution body of the wake-up method in the embodiment of the present application is a server, the wake-up method in the embodiment of the present application specifically includes:

s1: the intelligent device acquires voice data input by a user through the microphone.

S2: and the intelligent equipment sends the voice data to the server.

S3: and the server takes the awakening voice as an input parameter according to the trained voice recognition model, recognizes whether the awakening voice contains a preset awakening word or not, and obtains a probability score of whether the awakening voice contains the preset awakening word or not.

S4: the server judges whether the probability score is greater than or equal to a preset probability threshold.

S5: and if the server determines that the probability score is larger than or equal to the preset probability score threshold value, generating a wake-up instruction.

S6: and the server sends the awakening instruction to the intelligent equipment.

S7: and the intelligent equipment wakes up according to the received wake-up instruction.

In this embodiment of the present application, when the execution subject of the wake-up method is a server, the execution subject is not limited in this embodiment of the present application.

After the probability score of whether the awakening voice contains the preset awakening word is obtained, whether the probability score is larger than or equal to a preset probability score threshold value is judged, and the following two different situations are specifically included.

In the first case: the probability score is greater than or equal to a preset probability score threshold.

And if the probability score is larger than or equal to the preset probability score threshold value, awakening is determined.

In this embodiment of the application, a probability score threshold may be preset in the smart device or the server, and when the obtained probability score is greater than or equal to the preset probability score threshold, it is determined to wake up.

For example, the preset probability score threshold is 80 points, and if the probability score obtained at this time is 93 points, it is determined that the probability score is greater than the preset probability score threshold, and it is determined that the smart device is awakened.

In the second case: the probability score is less than a preset probability score threshold.

And if the probability score is smaller than the preset probability score threshold value, prompting the user to input the awakening voice again according to a preset prompting mode.

For example, the preset probability score threshold is 80 minutes, and if the probability score obtained at this time is 50 minutes, it is determined that the probability score is smaller than the preset probability score threshold, and the smart device cannot be awakened, and the smart device may prompt the user to re-input the awakening voice according to a preset prompting manner.

The preset prompting mode may be preset in the intelligent device, for example, the user is prompted to re-input the voice through voice or text, which is not limited in the embodiment of the present application.

In the embodiment of the application, the awakening voice is obtained and used as the input parameter of the trained voice recognition model, the speech recognition model is obtained by iterative training according to a speech sample set, the speech sample set at least comprises a target awakening word sound sample of a target user, then, whether the awakening voice contains a preset awakening word is identified through a voice identification model, the probability score of whether the awakening voice contains the preset awakening word is obtained, if the probability score is determined to be larger than or equal to a preset probability score threshold value, wake-up is determined such that, a speech recognition model is obtained through speech sample set training of target wake-up word tone samples of the target user, when the probability score is output by the voice recognition model, the probability score of the target user can be improved, therefore, the target user can wake up the intelligent device more easily, and the effect of waking up the intelligent device by the target user is greatly improved.

Based on the above embodiment, a training process of the speech recognition model is described in detail below, and referring to fig. 2, a flow of a training method of the speech recognition model in the embodiment of the present application is shown.

Step 200: a set of speech samples is obtained.

Wherein the voice sample set at least comprises target awakening word voice samples of target users, and the target users are VIP users.

Step 210: and inputting the voice sample set into a voice recognition model for training, and outputting the probability score of whether the recognized voice sample set contains the preset awakening words or not until the target function of the voice recognition model is converged to obtain the trained voice recognition model.

And the target function is the cross entropy function minimization of the recognition result of whether the probability score of the preset awakening word is contained or not.

The speech sample set in the embodiment of the present application can be divided into the following two different cases.

In the first case: the voice sample set includes a generic wake tone sample set and a target wake tone sample set.

The voice sample set comprises a general awakening word tone sample set and a target awakening word tone sample set, the general awakening word tone sample set comprises general awakening word tone samples of a plurality of non-target users, and the target awakening word sample set comprises target awakening word tone samples of a plurality of target users.

The target awakening word voice sample set is obtained by performing data simulation on each obtained target awakening word voice sample in a preset data simulation mode, wherein the data simulation mode at least comprises one or any combination of the following modes: changing speech speed, changing intonation and adding noise.

In the embodiment of the present application, when training the speech recognition model, the target user may input a target wake-up word speech into the speech recognition model for multiple times, for example, the target user inputs "power on" to the smart device for 20 times, and the content and the number of times of the target wake-up word are not limited in the embodiment of the present application. In this way, the set of speech samples of the speech recognition model includes at least target wake-up word-tone samples for a small number of target users.

However, since the target wake-up speech input by the target user is not too many, and may only be dozens of target wake-up speech, it is difficult to optimize the speech recognition model only with the wake-up speech data, that is, after the speech recognition model is trained only according to the target wake-up speech input by the target user, the wake-up effect of the trained speech recognition model on the target user is not good.

Therefore, a target wake word sound input by a target user needs to be subjected to data simulation through data simulation, and the data simulation mode at least comprises one or any combination of the following modes: the voice speed is changed, the tone is changed, and noise is added, so that a large number of target awakening word voice samples of target users can be obtained, the obtained target awakening word voice samples are generated into a target awakening word sample set, the voice recognition model is trained by using the target awakening word voice sample set until a target function of the voice recognition model is converged, the trained voice recognition model is obtained, and then whether the awakening voice contains the preset awakening word or not is recognized according to the trained voice recognition model by taking the awakening voice as an input parameter, the probability score of whether the awakening voice contains the preset awakening word or not is obtained, and higher probability score can be obtained, so that the intelligent equipment can be awakened better.

And if the voice sample set not only includes the target awakening word voice sample set but also includes the general awakening word voice sample set, the voice recognition model may be obtained by performing iterative training again according to the target awakening word sample set after the iterative training of the general awakening voice sample set.

Further, when the target user uses the smart device, the awakening voice is input, that is, as the times of using the smart device by the user are more and more, the target awakening word voice samples in the voice sample set are more and more, so that the voice recognition model can be optimized, and as for the VIP user, the awakening effect of the smart device is better and better along with the times of using the smart device.

In the second case: the voice sample set is a target wake word voice sample set.

The voice sample set comprises a target wake-up word voice sample set, and the target wake-up word sample set comprises target wake-up word voice samples of a plurality of target users.

In the embodiment of the present application, when the voice sample set is the target wake-up word sound sample set, the training mode is the same as the training mode of the voice sample set in the first case, which is not described herein repeatedly.

In the embodiment of the application, the acquired voice samples at least comprise the target awakening word voice samples of the target user in a concentrated manner, so that the voice recognition model finished aiming at the training of the target user can be acquired, and therefore when the target user inputs the awakening voice, the probability score of the awakening voice input by the target user can be improved, the target user can awaken the intelligent device more easily, and the awakening effect of the intelligent device is greatly improved.

Based on the same inventive concept, in the embodiment of the present application, a wake-up apparatus is provided, and the wake-up apparatus may be, for example, an intelligent device in the foregoing embodiment, or may also be a server, which is not limited in this embodiment of the present application, and the wake-up apparatus may be a hardware structure, a software module, or a hardware structure plus a software module. Based on the above embodiments, referring to fig. 3, a schematic structural diagram of a wake-up apparatus in the embodiment of the present application is shown, which specifically includes:

an obtaining module 300, configured to obtain a wake-up voice;

a processing module 310, configured to recognize whether the wake-up speech includes a preset wake-up word according to a trained speech recognition model, with the wake-up speech as an input parameter, and obtain a probability score of whether the wake-up speech includes the preset wake-up word, where the speech recognition model is obtained by iterative training according to a speech sample set, the speech sample set at least includes a target wake-up word sound sample of a target user, and the target user is a VIP user;

a determining module 320, configured to determine to wake up if it is determined that the probability score is greater than or equal to a preset probability score threshold.

Optionally, when obtaining the probability score of whether the wake-up speech includes the preset wake-up word, the determining module 320 is specifically configured to:

and obtaining the probability score of whether the awakening voice contains a preset awakening word or not according to the first probability score and the second probability.

Based on the same inventive concept, the embodiment of the present application provides a speech recognition model training apparatus, which may be, for example, a server or an intelligent device, and this is not limited in the embodiment of the present application. Based on the above embodiment, referring to fig. 4, a schematic structural diagram of a speech recognition model training apparatus in the embodiment of the present application specifically includes:

an obtaining module 400, configured to obtain a voice sample set, where the voice sample set at least includes a target wake-up word sound sample of a target user, and the target user is a VIP user;

the training module 410 is configured to input the voice sample set to a voice recognition model for training, and output the probability score of whether the recognized speech sample set includes a preset wake-up word until a target function of the voice recognition model converges to obtain a trained speech recognition model, where the target function is a cross entropy function minimization of a recognition result of whether the recognition result includes the probability score of the preset wake-up word.

Based on the above embodiments, fig. 5 is a schematic structural diagram of an electronic device in an embodiment of the present application.

An embodiment of the present application provides an electronic device, which may include a processor 510 (CPU), a memory 520, an input device 530, an output device 540, and the like, wherein the input device 530 may include a keyboard, a mouse, a touch screen, and the like, and the output device 540 may include a display device, such as a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT), and the like.

Memory 520 may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides processor 510 with program instructions and data stored in memory 520. In the embodiment of the present application, the memory 520 may be used to store a program of any one of the wake-up methods or any one of the speech recognition model training methods in the embodiment of the present application.

The processor 510 is configured to execute any one of the wake-up methods or any one of the speech recognition model training methods according to the embodiment of the present application by calling the program instructions stored in the memory 520 according to the obtained program instructions by the processor 510.

Based on the foregoing embodiments, in the embodiments of the present application, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the wake-up method or the speech recognition model training method in any of the above method embodiments.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of waking up, comprising:

acquiring a wake-up voice;

2. The method of claim 1, wherein the speech recognition model is one or a combination of: a first speech recognition model and a second speech recognition model;

3. The method of claim 2, wherein obtaining a probability score of whether the wake-up speech includes a preset wake-up word comprises:

4. A method for training a speech recognition model, comprising:

5. The method of claim 4, wherein the set of speech samples comprises a set of generic wake word speech samples and a set of target wake word speech samples; the general awakening word voice sample set comprises general awakening word voice samples of a plurality of non-target users, and the target awakening word sample set comprises target awakening word voice samples of a plurality of target users; or the like, or, alternatively,

the voice sample set is a target wake word voice sample set.

6. The method of claim 5, wherein the target wake word voice sample set is obtained by performing data simulation on each obtained target wake word voice sample in a preset data simulation manner, where the data simulation manner at least includes one or any combination of the following: changing speech speed, changing intonation and adding noise.

7. A wake-up unit, comprising:

the acquisition module is used for acquiring the awakening voice;

8. The apparatus of claim 7, wherein the speech recognition model is one or a combination of: a first speech recognition model and a second speech recognition model;

9. The apparatus according to claim 8, wherein when obtaining the probability score of whether the wake-up speech includes a preset wake-up word, the determining module is specifically configured to:

10. A speech recognition model training apparatus, comprising:

11. The apparatus of claim 10, wherein the set of speech samples comprises a set of generic wake word speech samples and a set of target wake word speech samples; the general awakening word voice sample set comprises general awakening word voice samples of a plurality of non-target users, and the target awakening word sample set comprises target awakening word voice samples of a plurality of target users; or the like, or, alternatively,

the voice sample set is a target wake word voice sample set.

12. The apparatus of claim 10, wherein the target wake word voice sample set is obtained by performing data simulation on each obtained target wake word voice sample in a preset data simulation manner, where the data simulation manner at least includes one or any combination of the following: changing speech speed, changing intonation and adding noise.

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1-3 or 4-6 are performed when the program is executed by the processor.

14. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when executed by a processor implements the steps of the method of any one of claims 1-3 or 4-6.