CN111312222B

CN111312222B - Awakening and voice recognition model training method and device

Info

Publication number: CN111312222B
Application number: CN202010091382.9A
Authority: CN
Inventors: 陈天峰; 冯大航; 陈孝良; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-02-13
Filing date: 2020-02-13
Publication date: 2023-09-12
Anticipated expiration: 2040-02-13
Also published as: CN111312222A

Abstract

The application relates to the technical field of computers, in particular to a wake-up and voice recognition model training method and device, which are used for acquiring wake-up voice; according to a trained voice recognition model, recognizing whether the wake-up voice contains a preset wake-up word or not by taking the wake-up voice as an input parameter, and obtaining a probability score of whether the wake-up voice contains the preset wake-up word or not, wherein the voice recognition model is obtained through iterative training according to a voice sample set, the voice sample set at least comprises target wake-up word voice samples of target users, and the target users are VIP users; if the probability score is determined to be greater than or equal to the preset probability score threshold, determining to wake up, so that the effect of the target user on the intelligent equipment wake up can be improved.

Description

Awakening and voice recognition model training method and device

Technical Field

The application relates to the technical field of computers, in particular to a wake-up and voice recognition model training method and device.

Background

At present, with the continuous development of artificial intelligence technology, more and more intelligent devices appear, for example, an intelligent sound box and the like, when a user needs to use the intelligent device, the intelligent device needs to be awakened from a dormant state to continue to use the intelligent device, and as the awakening effect of the existing intelligent device mainly aims at most people, when the tone or tone of a specific person is very special, the situation that the intelligent device cannot be awakened or the intelligent device needs to be awakened by inputting awakening voice for multiple times possibly appears, so that the user experience is very influenced, and therefore, the effect of awakening the intelligent device by a target user is improved, and the problem to be solved urgently is solved.

Disclosure of Invention

The embodiment of the application provides a wake-up and voice recognition model training method and device, which are used for improving the wake-up effect of a target user on intelligent equipment.

The specific technical scheme provided by the embodiment of the application is as follows:

a wake-up method, comprising:

obtaining wake-up voice;

according to a trained voice recognition model, recognizing whether the wake-up voice contains a preset wake-up word or not by taking the wake-up voice as an input parameter, and obtaining a probability score of whether the wake-up voice contains the preset wake-up word or not, wherein the voice recognition model is obtained through iterative training according to a voice sample set, the voice sample set at least comprises target wake-up word voice samples of target users, and the target users are VIP users;

and if the probability score is determined to be greater than or equal to a preset probability score threshold value, determining to wake up.

Optionally, the speech recognition model is one or a combination of the following: a first speech recognition model, a second speech recognition model;

if the voice sample set comprises a general wake-up word voice sample set and a target wake-up word voice sample set, the voice recognition model is the first voice recognition model, the general wake-up word voice sample set comprises general wake-up word voice samples of a plurality of non-target users, and the target wake-up word sample set comprises target wake-up word voice samples of a plurality of target users;

And if the voice sample set is the target wake-up word voice sample set, the voice recognition model is the second voice recognition model.

Optionally, obtaining the probability score of whether the wake-up voice includes the preset wake-up word specifically includes:

according to the trained first voice recognition model, recognizing whether the wake-up voice contains a preset wake-up word or not by taking the wake-up voice as an input parameter, and obtaining a first probability score of whether the wake-up voice contains the preset wake-up word or not;

according to the trained second voice recognition model, recognizing whether the wake-up voice contains a preset wake-up word or not by taking the wake-up voice as an input parameter, and obtaining a second probability score of whether the wake-up voice contains the preset wake-up word or not;

and obtaining a probability score of whether the wake-up voice contains a preset wake-up word or not according to the first probability score and the second probability score.

A speech recognition model training method, comprising:

acquiring a voice sample set, wherein the voice sample set at least comprises target wake-up word voice samples of target users, and the target users are VIP users;

and inputting the voice sample set into a voice recognition model for training, outputting probability scores of whether the voice recognition model contains a preset wake-up word or not, and obtaining the trained voice recognition model until an objective function of the voice recognition model converges, wherein the objective function is a cross entropy function of a recognition result of the probability scores of whether the voice recognition model contains the preset wake-up word or not.

Optionally, the voice sample set includes a general wake word voice sample set and a target wake word voice sample set; the general wake-up word speech sample set comprises general wake-up word speech samples of a plurality of non-target users, and the target wake-up word sample set comprises target wake-up word speech samples of a plurality of target users; or, the voice sample set is a target wake-up word voice sample set.

Optionally, the target wake-up word speech sample set is obtained by performing data simulation on each obtained target speech wake-up word speech sample in a preset data simulation mode, where the data simulation mode at least includes one or any combination of the following: the speech speed is changed, the intonation is changed, and noise is added.

A wake-up device, comprising:

the acquisition module is used for acquiring wake-up voice;

the processing module is used for recognizing whether the wake-up voice contains a preset wake-up word or not according to a trained voice recognition model by taking the wake-up voice as an input parameter, and obtaining a probability score of whether the wake-up voice contains the preset wake-up word or not, wherein the voice recognition model is obtained through iterative training according to a voice sample set, the voice sample set at least comprises target wake-up word voice samples of target users, and the target users are VIP users;

And the determining module is used for determining awakening if the probability score is determined to be greater than or equal to a preset probability score threshold value.

Optionally, when obtaining the probability score of whether the wake-up voice includes the preset wake-up word, the determining module is specifically configured to:

A speech recognition model training apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a voice sample set, the voice sample set at least comprises a target wake-up word voice sample of a target user, and the target user is a VIP user;

the training module is used for inputting the voice sample set into a voice recognition model to train and outputting probability scores of whether the voice recognition model contains a preset wake-up word or not until an objective function of the voice recognition model converges to obtain the trained voice recognition model, wherein the objective function is a cross entropy function minimization of a recognition result of the probability scores of whether the voice recognition model contains the preset wake-up word or not.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the wake-up method or the speech recognition model training method described above when the program is executed.

A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the above-described speech recognition model training method.

According to the embodiment of the application, the wake-up voice is acquired, according to the voice recognition model obtained through iterative training of the voice sample set, the voice sample set at least comprises target wake-up word voice samples of the target user, then the wake-up voice is taken as an input parameter and is input into the voice recognition model, whether the wake-up voice contains a preset wake-up word or not is recognized, a probability score of whether the wake-up voice contains the preset wake-up word or not is obtained, if the probability score is determined to be greater than or equal to a preset probability threshold value, wake-up is determined, and therefore, as the voice sample set at least comprises the target wake-up word voice samples of the target user, when the target user inputs the wake-up voice, the probability score of the wake-up voice input by the target user can be improved, so that the target user can wake-up intelligent equipment more easily, and the wake-up effect of the target user on the intelligent equipment is greatly improved.

Drawings

FIG. 1 is a flowchart of an intelligent device wake-up method in an embodiment of the application;

FIG. 2 is a flowchart of a method for training a speech recognition model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a wake-up device according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a training device for a speech recognition model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

At present, with the continuous development of artificial intelligence technology, more and more intelligent devices appear, for example, an intelligent sound box and the like, when a user needs to use the intelligent device, the intelligent device needs to be awakened from a sleep state to continue to use the intelligent device, however, as the awakening effect of the intelligent device in the prior art is mainly aimed at most people, that is, for example, in 100 people, after the awakening voice is input, 99 people can easily awaken the intelligent device, however, when the tone or tone of a specific person is special, the situation that the intelligent device cannot be awakened, or the intelligent device needs to be awakened by inputting the awakening voice for a plurality of times, so that the awakening effect of the intelligent device on the specific user is very poor, and therefore, how to improve the awakening effect of the target user on the intelligent device becomes a problem to be solved urgently.

In the embodiment of the application, the wake-up voice is acquired, whether the wake-up voice contains the preset wake-up word or not is identified according to the voice recognition model which is subjected to iterative training through the voice sample set, whether the wake-up voice contains the preset wake-up word or not is taken as an input parameter, and the probability score of whether the wake-up voice contains the preset wake-up word or not is obtained, wherein the voice sample set at least comprises the target wake-up word voice sample of the target user, if the probability score is determined to be greater than or equal to the preset probability score threshold, the wake-up is determined, so that the obtained voice recognition model is trained through the voice sample set at least comprising the target wake-up word voice sample of the target user, the probability score obtained when the target user inputs the wake-up voice is greatly improved, and the intelligent equipment can be easily awakened for the target user.

Based on the foregoing embodiments, referring to fig. 1, a flowchart of a wake-up method in an embodiment of the present application is shown, and an execution subject of the wake-up method in the embodiment of the present application is not limited, and may be applied to, for example, an intelligent device, a server, etc., and the wake-up method specifically includes:

step 100: and obtaining wake-up voice.

In the embodiment of the application, when the wake-up voice is acquired, the wake-up voice of the user can be acquired through the microphone on the intelligent device, and the embodiment of the application is not limited to the above.

The microphone may be, for example, a single microphone, a dual microphone, or a plurality of microphones, and the plurality of microphones may also form different microphone arrays, for example, linear, ring, or L-shaped microphone arrays, which is not limited in the embodiment of the present application.

The wake-up speech of the user may be, for example, "on" and "HELLO", which is not limited in the embodiment of the present application.

The intelligent device may be, for example, an intelligent sound box, an intelligent air conditioner, etc., which is not limited in the embodiment of the present application.

Step 110: and according to the trained voice recognition model, recognizing whether the wake-up voice contains a preset wake-up word or not by taking the wake-up voice as an input parameter, and obtaining a probability score of whether the wake-up voice contains the preset wake-up word or not.

The speech recognition model is obtained through iterative training according to a speech sample set, the speech sample set at least comprises target wake-up word speech samples of target users, and the target users are VIP users.

That is, in the embodiment of the present application, the wake-up speech input by the user is input into the trained speech recognition model, that is, the wake-up speech is taken as an input parameter, whether the wake-up speech contains a preset wake-up word is recognized, and the output parameter is a probability score of whether the wake-up speech contains the preset wake-up word.

For example, if the preset wake-up word in the trained speech recognition model is "on", when the wake-up speech input by the user is "on", the "on" is input into the trained speech recognition model, and whether the preset wake-up word is included in the "on" is recognized, at this time, since the preset wake-up word in the trained speech recognition model is "on", the preset wake-up word included in the wake-up speech, that is, "on", can be recognized, and the probability score of whether the preset wake-up word is included in the wake-up speech is obtained as 78 minutes.

In the embodiment of the application, the preset wake-up word is not limited, and the probability score of whether the wake-up voice contains the preset wake-up word is not limited.

Step 120: and if the probability score is determined to be greater than or equal to the preset probability score threshold value, determining to wake up.

Specifically, after the wake-up voice of the user is obtained, according to the trained voice recognition model, whether the wake-up voice contains a preset wake-up word or not is recognized according to the trained voice recognition model, a probability score of whether the wake-up voice contains the preset wake-up word or not is obtained, and whether the wake-up voice is wakened or not is judged according to the obtained probability score.

In the embodiment of the present application, before executing step 120, a probability score of wake-up voice needs to be obtained first, so as to determine whether to wake-up the intelligent device according to the probability score.

The wake-up device effect index in the prior art is mainly aimed at most people, that is, when 100 users wake up the smart device, the probability of waking up the smart device may be as high as 99%, and it may not be good for a specific user, for example, the specific user needs to input wake-up voice multiple times to wake up the smart device.

For example, when the specific user inputs the wake-up voice "power on" for the first time, the intelligent device does not wake up, which means that when the specific user inputs the wake-up voice "power on" for the first time, according to the trained voice recognition model, if the wake-up voice is taken as an input parameter, whether the wake-up voice contains the preset wake-up word "power on" or not is recognized, and a probability score of whether the wake-up voice contains the preset wake-up word or not is obtained and does not exceed a preset probability score threshold, then when the specific user inputs the wake-up voice "power on" for the second time, a situation that whether the specific user cannot wake up is likely to occur or not is likely to occur, and when the specific user inputs the wake-up voice "power on" for the third time, if the probability score of the wake-up voice at this time is greater than or equal to the preset probability score threshold, determining to wake up.

That is, in the prior art, since the wake-up effect of the wake-up device on some users is not good due to the different intonation, speech speed and tone of different users, in the embodiment of the application, the wake-up method is provided, so that the wake-up effect of the wake-up device on target users is better, for example, the number of people in one family is not more than 10, the actual need of the wake-up device is that the optimal effect is obtained by summarizing the specific 10 people, therefore, the wake-up method in the embodiment of the application is needed to enable the intelligent device to have a certain individuality, and has a good effect on the wake-up of the VIP users, and the same effect on non-target users is kept.

Since the speech recognition model is obtained through iterative training according to a speech sample set, the speech sample set at least includes target wake-up word speech samples of the target user, that is, the speech sample set at least includes target wake-up word speech samples of the target user, for example, the speech sample set includes target wake-up word speech samples of the target user, at this time, when the wake-up speech input by the target user into the speech recognition model is "on", since the speech sample in the speech recognition model includes at least the target wake-up word speech samples of the target user, the probability score of whether the obtained wake-up speech includes the preset wake-up word is higher compared with the wake-up method in the prior art, so that the wake-up device determines wake-up.

Further, since the speech recognition model in the embodiment of the present application is obtained after iterative training according to the speech sample set, the speech sample set at least includes the target wake-up word speech samples of the target user, and the speech recognition model can be further divided into at least two types:

first kind: the speech recognition model is a first speech recognition model.

If the voice sample set comprises a general wake word voice sample set and a target wake word voice sample set, the voice recognition model is a first voice recognition model.

The general wake-up word voice sample set comprises general wake-up word voice samples of a plurality of non-target users, and the target wake-up word sample set comprises target wake-up word voice samples of a plurality of target users.

In the embodiment of the application, the voice sample set of the first voice recognition model comprises a universal wake-up word voice sample set and a target wake-up word voice sample set.

The general wake-up word voice sample set includes a plurality of general wake-up word voice samples of non-target users, and the general wake-up word voice sample set includes at least a plurality of wake-up word voice samples, for example, "on-state", "how weather today" and the like, which is not limited in the embodiment of the present application.

In addition, each wake word sample includes a plurality of voice samples of non-target users, for example, if the wake word is "on", the general wake word voice sample set includes a plurality of general wake word voice samples of non-target users, for example, a general wake word voice sample of 10 years old boys, 7 years old girls, 34 years old females, and 56 years old males is "on", which is not limited in the embodiment of the present application.

The target user may be one target user or may be a plurality of target users, which is not limited in the embodiment of the present application.

The target wake-up word speech sample set includes target speech wake-up word speech samples of a plurality of target users, that is, the target wake-up word speech sample set includes at least target speech wake-up words of a plurality of target users, for example, such as "power on", "how weather today" is, which is not limited in the embodiment of the present application.

Each target voice wake-up word also includes voice samples of a plurality of target users, for example, if the wake-up word is "on", the target voice wake-up word includes at least a plurality of target users, that is, includes a plurality of "on" input by the target users.

When the speech recognition model is the first speech recognition model, the executing step of obtaining the probability score of whether the wake-up speech contains the preset wake-up word specifically includes:

s1: and according to the trained first voice recognition model, recognizing whether the wake-up voice contains a preset wake-up word or not by taking the wake-up voice as an input parameter, and obtaining a first probability score of whether the wake-up voice contains the preset wake-up word or not.

The voice sample set of the first voice recognition model comprises a general wake-up word voice sample set and a target wake-up word voice sample set, the general wake-up word voice sample set comprises general wake-up word voice samples of a plurality of non-target users, and the target wake-up word sample set comprises target wake-up word voice samples of a plurality of target users.

Specifically, according to the trained first voice recognition model, the received wake-up voice of the user is used as an input parameter of the first voice recognition model, whether the wake-up voice contains a preset wake-up word or not is recognized through the trained first voice recognition model, and a first probability score of whether the wake-up voice contains the preset wake-up word or not is obtained.

For example, assuming that the preset wake-up word is "small a quick start", the wake-up voice input by the target user into the intelligent device is "small a quick start", the intelligent device uses the received wake-up voice of the user as an input parameter of the first voice recognition model, recognizes the wake-up word "small a quick start" through the trained first voice recognition model, and obtains whether the wake-up voice contains the first probability score of the preset wake-up word as 89 minutes, which is not limited in the embodiment of the application.

In addition, in the embodiment of the application, when the voice recognition model is the first voice recognition model, the voice sample set comprises the universal wake word voice sample set and the target wake word voice sample set, so that the wake effect on the target user can be improved, and the wake effect on the non-target user can be maintained.

S2: and according to the trained second voice recognition model, recognizing whether the wake-up voice contains a preset wake-up word or not by taking the wake-up voice as an input parameter, and obtaining a second probability score of whether the wake-up voice contains the preset wake-up word or not.

The voice sample set of the second voice recognition model at least comprises a target wake-up word voice sample set, and the target wake-up word sample set comprises target wake-up word voice samples of a plurality of target users.

Specifically, according to the trained second voice recognition model, the received wake-up voice of the user is taken as the input parameter of the second voice recognition model, whether the wake-up voice contains a preset wake-up word or not is recognized through the trained second voice recognition model, and a second probability score of whether the wake-up voice contains the preset wake-up word or not is obtained.

For example, assuming that the preset wake-up word is "small a quick start", the wake-up voice input by the target user into the intelligent device is "small a quick start", the intelligent device uses the received wake-up voice of the user as an input parameter of the second voice recognition model, recognizes the wake-up word "small a quick start" through the trained second voice recognition model, and obtains a second probability score of 93 minutes if the wake-up voice contains the preset wake-up word, which is not limited in the embodiment of the application.

S3: and obtaining the probability score of whether the wake-up voice contains the preset wake-up word or not according to the first probability score and the second probability score.

Specifically, when step S3 is executed, the following two different manners may be classified into the following two manners, which are examples in the present application, and in the embodiment of the present application, other manners may also be adopted to obtain the probability score of whether the wake-up speech includes the preset wake-up word, which is not limited in the embodiment of the present application.

The first way is: and obtaining the probability score of whether the wake-up voice contains the preset wake-up word or not according to the sum of the first probability score and the second probability score.

The step S3 is executed, specifically including:

and adding the first probability score and the second probability score to obtain the probability score of whether the wake-up voice contains the preset wake-up word.

For example, assuming that the first probability score is 89 points and the second probability score is 93 points, the probability score of whether the wake-up speech includes the preset wake-up word is 182 points, which is not limited in the embodiment of the present application.

The second way is: a probability score is obtained from the weights.

The step S3 is executed, specifically including:

and obtaining the probability score of whether the wake-up voice contains the preset wake-up word or not according to the first probability score, the second probability score and the weights of the first probability score and the second probability score.

For example, assuming that the first probability score is 89 points, the second probability score is 93 points, the weight of the first probability score is 0.4, and the weight of the second probability score is 0.6, it may be obtained whether the wake-up speech includes the preset wake-up word or not, that is, whether the wake-up speech includes the preset wake-up word or not is 89×0.4+93×0.6=91.4 points, that is, whether the wake-up speech includes the preset wake-up word or not is 91.4 points, which is not limited in the embodiment of the present application.

Second kind: the speech recognition model is a second speech recognition model.

If the voice sample set is the target wake word voice sample set, the voice recognition model is a second voice recognition model.

The second speech recognition model comprises a target wake-up word speech sample set, wherein the target wake-up word speech sample set comprises target wake-up word speech samples of a plurality of target users.

In the embodiment of the application, when the voice recognition model is the second voice recognition model, the probability score of whether the wake-up voice contains the preset wake-up word can be directly obtained.

For example, assuming that the preset wake-up word is "how today's weather" and the wake-up voice input by the target user into the smart device is "how today's weather", the smart device uses the received wake-up voice of the user "how today's weather" as an input parameter of the second speech recognition model, recognizes the wake-up word "how today's weather" through the trained second speech recognition model, and obtains a probability score of 87 points if the wake-up voice contains the preset wake-up word, which is not limited in the embodiment of the present application.

Further, if the execution main body of the wake-up method in the embodiment of the present application is a server, the wake-up method in the embodiment of the present application specifically includes:

s1: the intelligent device acquires voice data input by a user through a microphone.

S2: the intelligent device sends the voice data to the server.

S3: and the server recognizes whether the wake-up voice contains a preset wake-up word or not according to the trained voice recognition model by taking the wake-up voice as an input parameter, and obtains a probability score of whether the wake-up voice contains the preset wake-up word or not.

S4: the server judges whether the probability score is greater than or equal to a preset probability threshold.

S5: and if the server determines that the probability score is greater than or equal to the preset probability score threshold, generating a wake-up instruction.

S6: and the server sends a wake-up instruction to the intelligent device.

S7: the intelligent device wakes up according to the received wake-up instruction.

In the embodiment of the present application, when the execution subject of the wake-up method is a server, the execution subject is not limited in the embodiment of the present application.

After obtaining whether the probability score of the preset wake-up word is contained in the wake-up voice, judging whether the probability score is larger than or equal to a preset probability score threshold, wherein the probability score specifically comprises the following two different cases.

First case: the probability score is greater than or equal to a preset probability score threshold.

And if the probability score is determined to be greater than or equal to the preset probability score threshold value, determining to wake up.

In the embodiment of the application, a probability score threshold value can be preset in the intelligent device or the server, and when the obtained probability score is greater than or equal to the preset probability score threshold value, the wake-up is determined.

For example, the preset probability score threshold is 80 points, and if the probability score obtained at this time is 93 points, it is determined that the probability score is greater than the preset probability score threshold, and it is determined that the intelligent device is awakened, and in the embodiment of the present application, the preset probability score threshold is not limited.

Second case: the probability score is less than a preset probability score threshold.

If the probability score is smaller than the preset probability score threshold, prompting the user to input the wake-up voice again according to a preset prompting mode.

For example, the preset probability score threshold is 80 points, if the probability score obtained at this time is 50 points, it is determined that the probability score is smaller than the preset probability score threshold, and the smart device cannot be awakened, and the smart device can prompt the user to reenter the awakening voice according to a preset prompting mode.

The preset prompting mode may be preset in the intelligent device, for example, through voice or text, to prompt the user to re-input voice, which is not limited in the embodiment of the present application.

In the embodiment of the application, the wake-up voice is obtained and is used as the input parameter of the trained voice recognition model, the voice recognition model is obtained through iterative training according to the voice sample set, the voice sample set at least comprises target wake-up word voice samples of the target user, then whether the wake-up voice contains preset wake-up words or not is recognized through the voice recognition model, the probability score of whether the wake-up voice contains the preset wake-up words or not is obtained, if the probability score is determined to be greater than or equal to the preset probability score threshold, the wake-up is determined, so that the voice recognition model is obtained through the voice sample set training of the target wake-up word voice samples of the target user, the probability score of the target user can be improved when the voice recognition model outputs the probability score, the target user can wake-up intelligent equipment more easily, and the effect of the target user on waking-up intelligent equipment is greatly improved.

Based on the foregoing embodiments, the following describes the training process of the speech recognition model in detail, and referring to fig. 2, a flow of a training method of the speech recognition model in an embodiment of the present application is shown.

Step 200: a set of speech samples is obtained.

The voice sample set at least comprises target wake-up word voice samples of target users, and the target users are VIP users.

Step 210: and inputting the voice sample set into a voice recognition model for training, and outputting a probability score indicating whether the voice sample set contains a preset wake-up word or not until an objective function of the voice recognition model converges to obtain the trained voice recognition model.

The objective function is the cross entropy function minimization of the recognition result of the probability score of whether the preset wake-up word is contained.

The speech sample set in the embodiment of the present application can be divided into the following two different cases.

First case: the speech sample set includes a generic wake word speech sample set and a target wake word speech sample set.

The speech sample set comprises a general wake word speech sample set and a target wake word speech sample set, wherein the general wake word speech sample set comprises general wake word speech samples of a plurality of non-target users, and the target wake word sample set comprises target wake word speech samples of a plurality of target users.

The target wake-up word voice sample set is obtained by carrying out data simulation on each obtained target voice wake-up word voice sample through a preset data simulation mode, and the data simulation mode at least comprises one or any combination of the following steps: the speech speed is changed, the intonation is changed, and noise is added.

In the embodiment of the application, when the voice recognition model is trained, the target user can input the target wake-up word voice into the voice recognition model for a plurality of times, for example, the target user inputs the intelligent device for 20 times of power-on, and the content and the times of the target wake-up word are not limited in the embodiment of the application. Thus, the speech sample set of the speech recognition model includes at least target wake-up word speech samples from a small number of target users.

However, since the target wake-up word speech input by the target user is not too much, there may be only tens of pieces of wake-up word speech data, and it is difficult to optimize the speech recognition model only by these wake-up word speech data, that is, after training the speech recognition model only according to the target wake-up word speech input by the target user, the wake-up effect of the trained speech recognition model on the target user is not too good.

Therefore, it is necessary to perform data simulation on the target wake-up word speech input by the target user through data simulation, where the data simulation mode includes at least one or any combination of the following: the speech speed is changed, the intonation is changed, noise is added, a large number of target wake-up word speech samples of target users can be obtained, the obtained target wake-up word speech samples are generated into a target wake-up word sample set, then the target wake-up word speech sample set is used for training a speech recognition model until the objective function of the speech recognition model is converged, a trained speech recognition model is obtained, further, according to the trained speech recognition model, whether wake-up speech contains preset wake-up words or not is recognized, whether the wake-up speech contains the preset wake-up words or not is obtained, probability scores of the preset wake-up words or not are obtained, and higher probability scores can be obtained, so that intelligent equipment can be well awakened.

If the voice sample set includes not only the target wake-up word voice sample set but also the general wake-up word voice sample set, the voice recognition model may be obtained by performing iterative training on the general wake-up voice sample set and then performing iterative training again according to the target wake-up word sample set.

Further, since the wake-up voice is input when the target user uses the smart device, that is, as the number of times that the user uses the smart device increases, the target wake-up word voice sample in the voice sample set increases, so that the voice recognition model can be optimized, and for the VIP user, the wake-up effect of the smart device is better along with the number of times of use.

Second case: the speech sample set is a target wake-up word speech sample set.

The speech sample set includes a target wake word speech sample set including target wake word speech samples of a plurality of target users.

In the embodiment of the present application, when the speech sample set is the target wake-up word speech sample set, the training mode is the same as that of the speech sample set in the first case, and will not be repeated here.

In the embodiment of the application, the acquired voice sample set at least comprises the target wake-up word voice sample of the target user, so that the voice sample set can be obtained from the voice recognition model trained for the target user, therefore, when the target user inputs wake-up voice, the probability score of the wake-up voice input by the target user can be improved, the target user can wake up the intelligent device more easily, and the wake-up effect of the intelligent device is greatly improved.

Based on the same inventive concept, a wake-up device is provided in the embodiment of the present application, and the wake-up device may be, for example, an intelligent device in the foregoing embodiment, or may be a server, which is not limited in the embodiment of the present application, and the wake-up device may be a hardware structure, a software module, or a hardware structure plus a software module. Based on the foregoing embodiments, referring to fig. 3, a schematic structural diagram of a wake-up device according to an embodiment of the present application is shown, which specifically includes:

an acquisition module 300, configured to acquire wake-up speech;

the processing module 310 is configured to identify, according to a trained speech recognition model, whether the wake-up speech includes a preset wake-up word according to the wake-up speech as an input parameter, and obtain a probability score of whether the wake-up speech includes the preset wake-up word, where the speech recognition model is obtained by iterative training according to a speech sample set, and the speech sample set includes at least a target wake-up word speech sample of a target user, and the target user is a VIP user;

A determining module 320, configured to determine to wake up if it is determined that the probability score is greater than or equal to a preset probability score threshold.

Optionally, when obtaining the probability score of whether the wake-up speech includes the preset wake-up word, the determining module 320 is specifically configured to:

and obtaining a probability score of whether the wake-up voice contains a preset wake-up word or not according to the first probability score and the second probability.

Based on the same inventive concept, the embodiment of the present application provides a speech recognition model training device, which may be, for example, a server or an intelligent device, and the embodiment of the present application is not limited to this, and the speech recognition model training device may be a hardware structure, a software module, or a hardware structure plus a software module. Based on the above embodiments, referring to fig. 4, a schematic structural diagram of a speech recognition model training device according to an embodiment of the present application specifically includes:

an obtaining module 400, configured to obtain a voice sample set, where the voice sample set includes at least a target wake-up word voice sample of a target user, and the target user is a VIP user;

the training module 410 is configured to input the speech sample set to a speech recognition model for training, and output a probability score indicating whether the speech recognition model includes a preset wake word or not until an objective function of the speech recognition model converges, so as to obtain a trained speech recognition model, where the objective function is a cross entropy function of a recognition result of the probability score indicating whether the speech recognition model includes the preset wake word or not.

Based on the above embodiments, referring to fig. 5, a schematic structural diagram of an electronic device according to an embodiment of the present application is shown.

Embodiments of the present application provide an electronic device that may include a processor 510 (Center Processing Unit, CPU), a memory 520, an input device 530, an output device 540, etc., where the input device 530 may include a keyboard, a mouse, a touch screen, etc., and the output device 540 may include a display device, such as a liquid crystal display (Liquid Crystal Display, LCD), a Cathode Ray Tube (CRT), etc.

Memory 520 may include Read Only Memory (ROM) and Random Access Memory (RAM) and provides processor 510 with program instructions and data stored in memory 520. In an embodiment of the present application, the memory 520 may be used to store a program of any one of the wake-up methods or any one of the speech recognition model training methods in the embodiment of the present application.

The processor 510 is configured to execute any one of the wake-up methods or any one of the speech recognition model training methods according to the embodiment of the present application by calling the program instructions stored in the memory 520 by the processor 510.

Based on the above embodiments, in the embodiments of the present application, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the wake-up method or the speech recognition model training method in any of the above method embodiments.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of waking up comprising:

obtaining wake-up voice;

If the probability score is determined to be greater than or equal to a preset probability score threshold, determining to wake up;

the obtaining a probability score of whether the wake-up voice contains a preset wake-up word specifically comprises the following steps:

according to the trained first voice recognition model, recognizing whether the wake-up voice contains a preset wake-up word or not by taking the wake-up voice as an input parameter, and obtaining a first probability score of whether the wake-up voice contains the preset wake-up word or not; according to the trained second voice recognition model, recognizing whether the wake-up voice contains a preset wake-up word or not by taking the wake-up voice as an input parameter, and obtaining a second probability score of whether the wake-up voice contains the preset wake-up word or not; the voice sample set of the first voice recognition model comprises a general wake-up word voice sample set and a target wake-up word voice sample set, wherein the general wake-up word voice sample set comprises general wake-up word voice samples of a plurality of non-target users, and the target wake-up word sample set comprises target wake-up word voice samples of a plurality of target users; the voice sample set of the second voice recognition model comprises a target wake-up word voice sample set;

2. The method of claim 1, wherein the speech recognition model is one or a combination of: a first speech recognition model, a second speech recognition model;

3. A speech recognition model training method based on the wake-up method of any of claims 1-2, comprising:

4. The method of claim 3, wherein the set of speech samples comprises a generic wake word speech sample set and a target wake word speech sample set; the general wake-up word speech sample set comprises general wake-up word speech samples of a plurality of non-target users, and the target wake-up word sample set comprises target wake-up word speech samples of a plurality of target users; or alternatively, the first and second heat exchangers may be,

the speech sample set is a target wake-up word speech sample set.

5. The method of claim 4, wherein the target wake-up word speech sample set is obtained by performing data simulation on each obtained target voice wake-up word speech sample by a preset data simulation mode, and the data simulation mode at least comprises one or any combination of the following: the speech speed is changed, the intonation is changed, and noise is added.

6. A wake-up device, comprising:

the acquisition module is used for acquiring wake-up voice;

the determining module is used for determining awakening if the probability score is determined to be greater than or equal to a preset probability score threshold value;

when obtaining a probability score of whether the wake-up voice contains a preset wake-up word, the determining module is specifically configured to:

7. The apparatus of claim 6, wherein the speech recognition model is one or a combination of: a first speech recognition model, a second speech recognition model;

8. A speech recognition model training device based on the wake-up method of any of claims 1-2, comprising:

9. The apparatus of claim 8, wherein the set of speech samples comprises a generic wake word speech sample set and a target wake word speech sample set; the general wake-up word speech sample set comprises general wake-up word speech samples of a plurality of non-target users, and the target wake-up word sample set comprises target wake-up word speech samples of a plurality of target users; or alternatively, the first and second heat exchangers may be,

the speech sample set is a target wake-up word speech sample set.

10. The apparatus of claim 8, wherein the target wake-up word speech sample set is obtained by performing data simulation on each obtained target voice wake-up word speech sample by a preset data simulation mode, the data simulation mode at least comprising one or any combination of the following: the speech speed is changed, the intonation is changed, and noise is added.

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1-2 or 3-5 when the program is executed.

12. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program implementing the steps of the method of any of claims 1-2 or 3-5 when executed by a processor.