CN114360531A

CN114360531A - Speech recognition method, control method, model training method and device thereof

Info

Publication number: CN114360531A
Application number: CN202111513576.4A
Authority: CN
Inventors: 赵鹏; 沙砼; 郭亚文
Original assignee: Shanghai Xiaodu Technology Co Ltd
Current assignee: Shanghai Xiaodu Technology Co Ltd
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-04-15

Abstract

The application discloses a voice recognition method, a control method, a model training method and a device thereof, and relates to the fields of deep learning, artificial intelligence and voice technology. The specific implementation scheme is as follows: acquiring voice activation detection VAD cut-off time and dialect types corresponding to voice interaction equipment operators based on identification information of the voice interaction equipment operators; acquiring first voice information of an operator of voice interaction equipment based on VAD cut-off time; and recognizing the first voice information based on a dialect voice recognition model corresponding to the dialect type. The method and the device can effectively solve the problems of slow speech and serious dialect accent of the speech interaction of a specific operator in the stage of sharing the speech interaction equipment by multiple persons, and improve the speech recognition accuracy.

Description

Speech recognition method, control method, model training method and device thereof

Technical Field

The application relates to the technical field of computers, in particular to the field of deep learning, the field of artificial intelligence and the technical field of voice, and particularly relates to a voice recognition method, a control method of voice interaction equipment, a truncation time optimization model training method and a device thereof.

Background

With the development of an aging society, the proportion of the old is increased year by year, the old is always worried in front of various electronic devices due to the physiological characteristics of poor memory, poor eyesight, poor learning and the like, and voice interaction can help the old to cross the gap of the use of the electronic devices. However, for the voice interaction stage of the old, the voice recognition technology in the related technology often has the problems of inaccurate recognition and the like.

Disclosure of Invention

The application provides a voice recognition method, a control method of voice interaction equipment, a truncation time optimization model training method and a device thereof, which can be applied to a voice interaction scene.

According to a first aspect of the present application, there is provided a speech recognition method comprising:

acquiring voice activation detection VAD cut-off time and dialect types corresponding to voice interaction equipment operators based on identification information of the voice interaction equipment operators;

acquiring first voice information of the voice interaction equipment operator based on the VAD cut-off time;

and identifying the first voice information based on a dialect voice identification model corresponding to the dialect type.

According to a second aspect of the present application, there is provided a control method of a voice interaction device, including:

acquiring collected voice information of an operator of the voice interaction equipment;

performing voiceprint feature recognition on the voice information to obtain voiceprint feature information;

determining that the operator of the voice interaction equipment is a specific operator based on the voiceprint characteristic information, and acquiring voice activation detection VAD cut-off time and dialect types corresponding to the specific operator according to the voiceprint characteristic information;

controlling the voice interaction device to perform voice interaction with the specific operator based on the VAD cutoff time and the dialect type.

According to a third aspect of the present application, there is provided a truncation time optimization model training method, wherein the truncation time optimization model is used for predicting VAD truncation time length in a voice interaction scenario, and the method comprises:

acquiring input voice information of a sample user in a voice interaction process with voice interaction equipment;

generating a training sample according to the input voice information, wherein the training sample comprises first interval time between awakening words and instruction words, second interval time between each participle in the instruction words and pickup time of the input voice information;

inputting a first interval time between the awakening word and the instruction word and a second interval time between each participle in the instruction word into an initial model to obtain a VAD truncation time predicted value;

and training the initial model based on the pickup time and the VAD cut-off time predicted value to obtain model parameters, and generating the VAD cut-off time optimization model based on the model parameters.

According to a fourth aspect of the present application, there is provided a speech recognition apparatus comprising:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring the voice activation detection VAD cut-off time and dialect type corresponding to a voice interaction device operator based on the identification information of the voice interaction device operator;

the acquisition module is used for acquiring first voice information of the operator of the voice interaction equipment based on the VAD cut-off time;

and the recognition module is used for recognizing the first voice information based on a dialect voice recognition model corresponding to the dialect type.

According to a fifth aspect of the present application, there is provided a control apparatus of a voice interaction device, comprising:

the first acquisition module is used for acquiring the acquired voice information of the voice interaction equipment operator;

the recognition module is used for carrying out voiceprint feature recognition on the voice information to obtain voiceprint feature information;

the second acquisition module is used for determining that the operator of the voice interaction equipment is a specific operator based on the voiceprint characteristic information, and acquiring the VAD cut-off time and the dialect type of voice activation detection corresponding to the specific operator according to the voiceprint characteristic information;

and the control module is used for controlling the voice interaction equipment to carry out voice interaction with the specific operator based on the VAD cut-off time and the dialect type.

According to a sixth aspect of the present application, there is provided a truncation time optimization model training apparatus, wherein the truncation time optimization model is used for predicting VAD truncation time length in a voice interaction scenario, the apparatus includes:

the acquisition module is used for acquiring the recorded voice information of the sample user in the voice interaction process with the voice interaction equipment;

the generating module is used for generating a training sample according to the input voice information, wherein the training sample comprises first interval time between awakening words and instruction words, second interval time between each participle in the instruction words and pickup time of the input voice information;

the prediction module is used for inputting first interval time between the awakening words and the instruction words and second interval time between each participle in the instruction words into an initial model to obtain a VAD truncation time prediction value;

and the training module is used for training the initial model based on the pickup time and the VAD cut-off time predicted value, obtaining model parameters and generating the VAD cut-off time optimization model based on the model parameters.

According to a seventh aspect of the present application, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

According to an eighth aspect of the present application, there is provided an electronic apparatus comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the second aspect.

According to a ninth aspect of the present application, there is provided an electronic device comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the third aspect.

According to a tenth aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the preceding first aspect.

According to an eleventh aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the foregoing second aspect.

According to a twelfth aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the aforementioned third aspect.

According to a thirteenth aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of the aforementioned first aspect.

According to a fourteenth aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of the aforementioned second aspect.

According to a fifteenth aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of the aforementioned third aspect.

According to the technical scheme of the application, the problems that a plurality of people share the voice interaction equipment stage to speak slowly and speak with a dialect accent seriously to the voice interaction of a specific operator can be effectively solved, and the voice recognition accuracy is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart of another speech recognition method provided by an embodiment of the present application;

FIG. 3 is a flow chart of another speech recognition method provided by an embodiment of the present application;

FIG. 4 is a flow chart of another speech recognition method provided by the embodiments of the present application;

fig. 5 is a flowchart of a control method of a voice interaction device according to an embodiment of the present application;

fig. 6 is a flowchart of another control method for a voice interaction device according to an embodiment of the present application;

FIG. 7 is a flowchart of a VAD cut-off time optimization model training method according to an embodiment of the present application;

fig. 8 is a block diagram illustrating a speech recognition apparatus according to an embodiment of the present disclosure;

fig. 9 is a block diagram of another speech recognition apparatus according to an embodiment of the present application;

fig. 10 is a block diagram illustrating a control apparatus of a voice interaction device according to an embodiment of the present disclosure;

fig. 11 is a block diagram of a control apparatus of another voice interaction device according to an embodiment of the present application;

fig. 12 is a block diagram illustrating a structure of a truncated time optimization model training apparatus according to an embodiment of the present disclosure;

fig. 13 is a block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the technical solution of the present application, the acquisition, storage, application, and the like of the personal information of the related user all conform to the regulations of the relevant laws and regulations, and do not violate the customs of the public order.

In the embodiments of the present application, the term "VAD (Voice Activity Detection)" is also called Voice Activity Detection or silence suppression, and aims to detect whether a current Voice signal includes a Voice signal, that is, to judge an input signal, distinguish the Voice signal from various background noise signals, and apply different processing methods to the two signals.

Furthermore, the terms "first", "second", "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature.

Based on the above problems, the present application provides a speech recognition method, a control method of a speech interaction device, a truncation time optimization model training method and a device thereof. A speech recognition method, a control method of a speech interaction device, a truncation time optimization model training method, and apparatuses thereof according to embodiments of the present application are described below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present application. It should be noted that the speech recognition method according to the embodiment of the present application can be applied to the speech recognition apparatus according to the embodiment of the present application, and the speech recognition apparatus can be configured in an electronic device. As an example, the electronic device may be a voice interaction device. As shown in fig. 1, the speech recognition method may include, but is not limited to, the following steps.

In step 101, a voice activation detection VAD cut-off time and dialect type corresponding to a voice interaction device operator are obtained based on identification information of the voice interaction device operator.

In the embodiment of the present application, the voice interaction device may be understood as an electronic device with a voice interaction function, such as a smart speaker with a voice interaction function, a smart screen with a voice interaction function, and the like. The voice interaction device operator may be understood as a user interacting with the voice interaction device. The method and the device are suitable for the scene of the voice interaction equipment shared by multiple persons, and can also be suitable for the scene of the voice interaction equipment special for one person.

It is to be understood that the identification information can be understood to identify which operator is using the voice interaction device. In the embodiment of the present application, the identification information may be voiceprint feature information, or may also be face feature information, or may also be user name information, and the like. As an example, the identification information is user name information, and the voice interaction device may provide a login interface for the operator, and the operator logs in the login interface with the user name information to confirm that the operation is authorized to use the voice interaction device. The voice interaction device can acquire the user name information of the operator from the login interface so as to determine which operator uses the voice interaction device.

For example, when an operator uses a voice interaction device, the voice interaction device may determine identification information of the operator, and based on the identification information, obtain a VAD cutoff time and a dialect type corresponding to the operator. That is to say, the VAD cut-off time and the dialect type corresponding to the operation may be determined by using the identification information of the different operations, so as to control the voice interaction device to perform voice interaction with the operator based on the VAD cut-off time and the dialect type.

In the embodiments of the present application, the VAD cut-off time may be understood as the time to pick up sound, e.g. to control how long no sound is cut off. For example, when the VAD truncation time is exceeded, the voice interaction device truncates the acquisition of the voice.

In step 102, first voice information of an operator of a voice interaction device is collected based on a VAD cutoff time.

For example, a voice collecting device (such as a microphone) is generally disposed on the voice interaction device, and when determining a VAD cut-off time corresponding to an operation of the voice interaction device, the voice interaction device may be controlled to collect voice of the operator based on the VAD cut-off time, so as to collect the first voice information of the operator.

It is understood that the first voice information is the voice information collected by the voice interaction device in the scenario of switching to voice collection after VAD cutoff time corresponding to the operator of the voice interaction device after the identity of the operator of the voice interaction device is determined. That is, the first voice message may be a wakeup word voice message, or may also be an instruction voice message without a wakeup word, or may also be an instruction voice message containing a wakeup word.

In step 103, the first speech information is recognized based on the dialect speech recognition model corresponding to the dialect type.

In an embodiment of the present application, when the first voice information of the voice interaction device operator is acquired based on the VAD truncation time, the first voice information may be subjected to voice recognition based on a dialect voice recognition model corresponding to a dialect type used by the voice interaction device operator.

For example, assuming that dialect speech recognition models of a plurality of dialect types are deployed offline on the speech interaction device, when determining the dialect type of the operator of the speech interaction device, a dialect speech recognition model corresponding to the dialect type of the operator of the speech interaction device may be selected from the dialect speech recognition models of the plurality of dialect types, and speech information input by the operator of the speech interaction device may be recognized by using the selected dialect speech recognition model.

In the embodiment of the application, it is assumed that dialect speech recognition models of multiple dialect types are deployed in the cloud server. After the dialect type of the voice interaction device operator is determined, the voice interaction device can acquire and store a dialect voice recognition model corresponding to the dialect type from the cloud server, so that voice information input by a corresponding operator can be recognized by using the dialect voice recognition model in the following process.

In the embodiment of the application, it is assumed that dialect speech recognition models of multiple dialect types are deployed in the cloud server. When the dialect type of the operator of the voice interaction device is determined, and the first voice information of the operator is acquired based on the VAD cut-off time of the operator, the dialect type and the first voice information can be sent to a cloud server. The cloud server selects a corresponding dialect voice recognition model based on the dialect type, and recognizes the first voice information by using the selected dialect voice recognition model, namely, voice recognition is performed on the cloud server to obtain a voice recognition result, so that subsequent processing is performed based on the voice recognition result, for example, corresponding interactive speech is generated based on the voice recognition result.

According to the voice recognition method, the voice information of the voice interaction equipment operator can be collected based on the VAD cut-off time corresponding to the voice interaction equipment operator, and the voice information is recognized based on the dialect voice recognition model corresponding to the dialect type used by the voice interaction equipment operator, so that the problems of slow voice interaction speaking and serious dialect accent of the voice interaction equipment operator can be effectively solved, and the voice recognition accuracy is improved.

In an embodiment of the present application, VAD cut-off times can be switched based on voiceprint feature silence of different operators, and different dialect speech recognition models can be triggered to improve comprehensibility of speech instructions. As shown in fig. 2, the speech recognition method of the embodiment of the present application may include, but is not limited to, the following steps.

In step 201, second voice information of the collected voice interaction device operator is obtained, and voiceprint feature recognition is performed on the second voice information to obtain voiceprint feature information.

In an embodiment of the application, the second voice message may be understood as a voice message collected by the voice interaction device without knowing an identity of an operator of the voice interaction device, so as to identify voiceprint feature information by using the voice message, and further determine who an operator operating the voice interaction device is specifically by using the voiceprint feature information, for example, the second voice message may be a wake-up word voice message for waking up the voice interaction device.

For example, when obtaining the second voice information of the operator of the voice interaction device, the second voice information may be input into the voiceprint feature extraction model, and the voiceprint feature information output by the voiceprint feature extraction model is obtained, where the voiceprint feature information is the voiceprint feature of the operator of the voice interaction device, so that which operator performs voice interaction with the voice interaction device can be determined based on the voiceprint feature information.

In one implementation, the voiceprint feature extraction model may be pre-trained using training data. That is, the voiceprint feature extraction model may be trained in advance using training data, so that the voiceprint feature extraction model has the capability of extracting voiceprint features. The training mode of the voiceprint feature extraction model can adopt a training mode in the related technology, and details are not repeated herein.

In step 202, according to the voiceprint feature information, VAD cutoff time and dialect type corresponding to the voice interaction device operator are obtained.

In one implementation, according to the voiceprint feature information, the VAD cut-off time corresponding to the operator of the voice interaction device may be obtained from a mapping relationship between a voiceprint feature and VAD cut-off time that is established in advance.

In an embodiment of the present application, a mapping relationship between the voiceprint characteristics and the VAD truncation time may be established in advance, where the mapping relationship may include one or more pieces of voiceprint characteristic information, and each piece of voiceprint characteristic information represents one operator. That is, the mapping may represent the correspondence of one or more operators to their VAD cut-off times.

For example, assuming that three operators (for example, operator 1, operator 2, and operator 3) share the same voice interaction device a, the mapping relationship includes: voiceprint characteristic information 11 of the operator 1 and the corresponding VAD cut time 12, voiceprint characteristic information 21 of the operator 2 and the corresponding VAD cut time 22, and voiceprint characteristic information 31 of the operator 3 and the corresponding VAD cut time 32. For example, the VAD interruption time corresponding to the voiceprint feature information 11 of the operator 1 is "600 msec", the VAD interruption time corresponding to the voiceprint feature information 21 of the operator 2 is "1200 msec", and the VAD interruption time corresponding to the voiceprint feature information 31 of the operator 3 is "1400 msec".

In the embodiment of the present application, the VAD cut-off time in the mapping relationship may be an empirical value set based on a large number of experiments, or may be predicted by using a trained VAD cut-off time optimization model. The input of the VAD cut-off time optimization model is the interval time between the awakening words and the instruction words and the interval time between each participle in the instruction words, and the output of the VAD cut-off time optimization model is a VAD cut-off time prediction value.

It can be understood that, since the speech speed of the operator is usually kept within a certain range within a certain time, the VAD cut-off time of the operator can be predicted by using the VAD cut-off time optimization model, and the VAD cut-off time of the operator is stored in the mapping relation, so that when the operator performs speech interaction with the speech interaction device within a certain time, the speech interaction device can collect the speech information input by the operator based on the VAD cut-off time of the mapping relation.

In an implementation manner, assuming that a trained VAD cut-off time optimization model is deployed on the voice interaction device, after the identity of an operator of the voice interaction device is determined, the VAD cut-off time suitable for the operator is predicted by using voice information input by the operator and the VAD cut-off time optimization model, that is, the VAD cut-off time suitable for the operator can be predicted in real time, so that the operator can speak the current interactive speech in the period of the VAD cut-off time.

In the embodiment of the present application, the correspondence between the voiceprint features and the dialect types may be established in advance. After obtaining the voiceprint feature information of the voice interaction device operator, the dialect type of the voice interaction device operator can be determined from the corresponding relationship based on the voiceprint feature information.

In step 203, first voice information of an operator of the voice interaction device is collected based on the VAD cutoff time.

In the embodiment of the present application, step 203 may be implemented by using any one of the embodiments of the present application, which is not limited in this embodiment and is not described again.

In step 204, the first speech information is recognized based on the dialect speech recognition model corresponding to the dialect type.

In the embodiment of the present application, step 204 may be implemented by using any one of the embodiments of the present application, which is not limited in this embodiment and is not described again.

According to the voice recognition method, the VAD cut-off time can be switched based on the voiceprint characteristics of different operators in a silent mode, different dialect voice recognition models are triggered to improve the comprehension capability of voice instructions, the corresponding VAD cut-off time and dialect types can be determined directly through the voiceprint characteristics of voice input by the operators, other operations cannot be performed by the operators, the steps of the operators are simplified, and the equipment flow is simplified.

It should be noted that, because the speaking speed of the operator may change with time due to the physiological characteristics of the operator, in order to ensure that the personalized pickup time of the operator is satisfied, the VAD cut-off time most suitable for the operator in the period of time may be predicted at regular intervals by using the speech information and the VAD cut-off time optimization model in the period of time, and then the newly predicted VAD cut-off time is updated into the mapping relationship. In some embodiments of the present application, as shown in fig. 3, the speech recognition method further includes the following steps.

In step 301, third voice information input by an operator of the voice interaction device during a voice interaction with the voice interaction device within a preset time period is obtained.

In an embodiment of the present application, the preset time period may be the past month, or the past half year, or the like. For example, voice information input by an operator of the voice interaction device during voice interaction with the voice interaction device over the past half year may be obtained. Wherein, the third voice message may include a wake word voice and an instruction voice.

In step 302, the third speech information is processed to obtain a first interval time between the wakeup word and the command word and a second interval time between each participle in the command word.

Alternatively, the interval time x between the wakeup word and the command word in the third speech message and the interval time between the commands after the command completes the word segmentation can be calculated through word segmentation (y1, y2 …).

In step 303, the first interval time and the second interval time are input to the trained VAD cut-off time optimization model to obtain a VAD cut-off time prediction value.

In step 304, the VAD cut-off time corresponding to the voiceprint feature information in the mapping relationship is updated based on the VAD cut-off time predicted value.

It should be noted that, because the speaking speed of the operator may change with time due to the physiological characteristics of the operator, in order to ensure that the personalized pickup time of the operator can be satisfied, the VAD cut-off time optimization model may be trained at regular intervals, so that the trained VAD cut-off time optimization model can better satisfy the usage requirement of the operator in the corresponding time period. In some embodiments of the present application, the VAD cut-off time optimization model may be adjusted by model parameters through training samples generated from the third speech information. For an implementation manner of training the VAD cut-off time optimization model, reference may be made to the description of the subsequent embodiments, and details are not repeated here.

According to the voice recognition method provided by the embodiment of the application, the voice information and the VAD cut-off time optimization model in the time period can be used for predicting the VAD cut-off time which is most suitable for the operator in the time period at regular intervals, and the newly predicted VAD cut-off time is updated into the mapping relation, so that the VAD cut-off time in the mapping relation can be guaranteed to be most suitable for the pickup time of the operator.

It should be noted that, generally, the voice interaction device is provided with an image acquisition device, and the image acquisition device may be used to determine identification information of an operator of the voice interaction device, so as to determine VAD cut-off time and dialect type corresponding to the operator of the voice interaction device. As shown in fig. 4, the speech recognition method may include, but is not limited to, the following steps.

In step 401, the acquired face image information of the operator of the voice interaction device is acquired.

In step 402, face feature recognition is performed on the face image information to obtain face feature information.

Optionally, a face feature extraction model established in advance may be used to extract the face features in the face image information, so that face feature information of an operator of the voice interaction device may be obtained.

In step 403, according to the face feature information, the VAD cut-off time and dialect type of voice activation detection corresponding to the operator of the voice interaction device are obtained.

Optionally, a corresponding relationship between the face features and the VAD truncation time and a corresponding relationship between the face features and the dialect types are pre-established, and based on the two corresponding relationships and face feature information of an operator of the voice interaction device, the VAD truncation time and the dialect type corresponding to the operator of the voice interaction device can be determined.

In step 404, first voice information of an operator of the voice interaction device is collected based on the VAD cutoff time.

In the embodiment of the present application, step 404 may be implemented by using any one of the embodiments of the present application, which is not limited in this embodiment and is not described again.

In step 405, the first speech information is recognized based on a dialect speech recognition model corresponding to the dialect type.

In the embodiment of the present application, step 405 may be implemented by using any one of the embodiments of the present application, which is not limited in this embodiment and is not described again.

According to the voice recognition method, the pickup time is switched based on the face feature silence, and different dialect voice recognition models are triggered, so that the comprehension capability of an operator instruction is improved, the problems that a plurality of persons share the equipment stage, speaking speed is slow for voice interaction of a specific operator (such as an old user), and dialect accents are heavy can be effectively solved, and the voice recognition accuracy is improved.

According to the embodiment of the application, the embodiment of the application also provides a control method of the voice interaction equipment.

Fig. 5 is a flowchart of a method for controlling a voice interaction device according to an embodiment of the present application. It should be noted that the control method of the voice interaction device in the embodiment of the present application is applicable to the control apparatus of the voice interaction device in the embodiment of the present application, and the control apparatus may be configured on an electronic device, for example, the electronic device may be a voice interaction device. As shown in fig. 5, the control method of the voice interactive apparatus may include, but is not limited to, the following steps.

In step 501, collected voice information of an operator of a voice interaction device is obtained.

In the embodiment of the present application, step 501 may be implemented by any one of the embodiments of the present application, which is not limited in this embodiment and is not described again.

In step 502, voiceprint feature recognition is performed on the voice information to obtain voiceprint feature information.

In the embodiment of the present application, step 502 may be implemented by any one of the embodiments of the present application, which is not limited in this embodiment and is not described again.

In step 503, the voice interaction device operator is determined to be a specific operator based on the voiceprint feature information, and the voice activation detection VAD cut-off time and dialect type corresponding to the specific operator are obtained according to the voiceprint feature information.

In an embodiment of the application, the specific operator may be an elderly user.

In the embodiment of the present application, step 503 may be implemented by using any one of the embodiments of the present application, which is not limited in this embodiment and is not described again.

In step 504, the voice interaction device is controlled to perform voice interaction with a specific operator based on the VAD cutoff time and the dialect type.

Optionally, voice information of a specific operator is collected based on the VAD cutoff time, and the voice information is identified based on a dialect voice identification model corresponding to the dialect type.

According to the control method provided by the embodiment of the application, the problems of slow speaking speed and serious dialect accent of the voice interaction of a specific operator in the stage of sharing the voice interaction equipment by multiple persons can be effectively solved, the voice recognition accuracy is improved, and therefore the voice interaction experience can be improved.

In an embodiment of the present application, the VAD cut-off time may be obtained using a VAD cut-off time optimization model. The VAD cut-off time optimization model can be trained in advance. In some embodiments of the present application, as shown in fig. 6, the method for controlling a voice interaction device may further include:

step 601, acquiring recorded voice information of a specific operator in a voice interaction process with a voice interaction device within a preset time period.

Step 602, generating a training sample according to the input voice information, wherein the training sample comprises a first interval time between the awakening word and the instruction word, a second interval time between each participle in the instruction word, and a pickup time of the input voice information.

Step 603, inputting a first interval time between the awakening word and the instruction word and a second interval time between each participle in the instruction word into a preset VAD truncation time optimization model to obtain a VAD truncation time prediction value.

And step 604, training a VAD cut-off time optimization model based on the pickup time and the VAD cut-off time predicted value.

For example, after the voice interaction device is first started, family members (elder, child, and young people …) of a family where the voice interaction device is located are selected, and when the user selects the elder family, if the user selects to optimize the voice recognition model, the voice collection phase is entered. A sound collection stage: if the old user is supposed to use the voice interaction device, the old user needs to read out the problems on the screen sequentially according to the caption prompt on the screen of the voice interaction device and the daily speed of speech. It can also freely speak out a conversation instruction with more than a certain number of words.

Recording a sound file: after the voice collection of the old user is finished. And judging the dialect type and the natural speech speed of the voice of the old user. Meanwhile, voiceprint characteristic information of the old user is recorded and used as a sound mark of the old user.

Dialect speech recognition model training: and in the dialect optimization stage, by distinguishing dialect types of the old-people users, the corpora of the old-people users are put into different dialect voice recognition models to perform dialect fuzzy tone training, so that the dialect voice recognition model suitable for the old-people users can be obtained.

It should be noted that the Vad cut-off time optimization model may also be pre-trained. Wherein, the training set: the voice interaction device can be used for recruiting old people in batches, the voice interaction device is used under natural conditions, and the voice interaction device is required to be normally used every day according to a certain same using path. And adjusting the pickup time of the batch of voice interaction equipment at the background, and subtracting 1s from the normal pickup time and adding 2s to form a pickup interval. In the process of using the old user in a single round, voice recording is carried out by adjusting 200ms every time in a sound pickup interval, and the sound pickup time in the single round is respectively (z1, z2., z 15). After recording the training sample, calculating the interval time x between the awakening word and the instruction word through word segmentation, and the interval time between the instructions after completing word segmentation (y1, y2 …). Meanwhile, the listening and understanding conditions of each instruction are evaluated and recorded through a manual GSB (G: goodcase, S: simple, B: badcase).

Model training: and (3) gathering the satisfying conditions of the GSB, namely x and y, and obtaining the GSB conditions of x, y and z under different conditions through an unsupervised clustering algorithm. The category clustered as G, S is a learning sample. While G is weighted higher than S. And obtaining the relative most suitable truncation time z under the condition of different final voices.

Data acquisition and model evaluation: after the voice collection of the old user is finished, the distance x between the awakening words and the instruction words and the distance y set between the instruction words and the instruction words of the old user are calculated after the words are segmented in the same way. And calculating the pickup truncation time z according to the standard of G/S through an input model.

And (3) model correction: and after the suitable cutoff time of the old user is calculated, the phase time takes effect for the old user. Meanwhile, in the daily use process, the x and y time of successful voice is continuously calculated at intervals. Through long-term correction, the time for truncating z is set, so that the voice interaction experience of the old user is continuously kept in a better state, and the badcase (bad example) which is truncated by mistake is effectively reduced.

In the actual speech interaction stage, a user identification process, a VAD cut-off time replacement process and a dialect speech recognition model switching process can be included. Wherein, the user judges: after the user sends out a wake-up word command for the voice interaction device, the voice interaction device can judge who the user corresponding to the wake-up word is through the analysis and matching of the voiceprint. And if the old people set with the personalized voice optimization, entering a voice optimization module to switch the personalized settings. VAD cut-off time replacement flow: after the interactive person is determined through voiceprint discrimination, the VAD cut-off time of the voice interactive equipment is automatically switched to the time for the interactive person, and then the pickup effect of the voice interactive round is optimized. Dialect speech recognition model switching process: and after the interactive person is determined through voiceprint discrimination, replacing the voice recognition model with a dialect voice recognition model aiming at the interactive person, and further optimizing the dialect recognition effect of the voice interaction.

According to the embodiment of the application, the embodiment of the application also provides a training method of the truncation time optimization model, and the truncation time optimization model can be used for predicting the VAD truncation time length in the voice interaction scene. Wherein, the cut-off time optimization model is the VAD cut-off time optimization model. As shown in fig. 7, the VAD cut-off time optimization model training method may include:

step 701, acquiring the recorded voice information of the sample user in the voice interaction process with the voice interaction device.

Step 702, generating a training sample according to the input voice information, wherein the training sample comprises a first interval time between the awakening word and the instruction word, a second interval time between each participle in the instruction word, and a pickup time of the input voice information.

Step 703, inputting a first interval time between the wake-up word and the instruction word and a second interval time between each participle in the instruction word into the initial model to obtain a predicted value of VAD truncation time.

And step 704, training an initial model based on the pickup time and the VAD cut-off time predicted value, obtaining model parameters, and generating a VAD cut-off time optimization model based on the model parameters.

For an implementation of the VAD truncation time optimization model training method according to the embodiment of the present application, reference may be made to the description of the implementation of the embodiment shown in fig. 6, and details are not described here again.

By implementing the embodiment of the application, a model capable of predicting VAD cut-off time can be trained, so that the optimal VAD cut-off time most suitable for a voice interaction device operator can be predicted by using the model, the problem that a specific operator speaks slowly in a voice interaction stage can be effectively solved, and the voice pickup of the voice interaction device on the specific operator is ensured to be complete.

The application also provides a voice recognition device. Fig. 8 is a block diagram of a speech recognition apparatus according to an embodiment of the present application. As shown in fig. 8, the voice recognition apparatus may include: a first acquisition module 801, an acquisition module 802 and an identification module 803.

The first obtaining module 801 is configured to obtain a VAD cut-off time and a dialect type corresponding to a voice-activated detection VAD based on identification information of a voice interaction device operator. In some embodiments of the present application, the first obtaining module 801 obtains second voice information of the collected voice interaction device operator, and performs voiceprint feature recognition on the second voice information to obtain voiceprint feature information; and acquiring VAD cut-off time and dialect types corresponding to the voice interaction equipment operator according to the voiceprint characteristic information.

In one implementation, the first obtaining module 801 may obtain, according to the voiceprint feature information, a VAD truncation time corresponding to an operator of the voice interaction device from a pre-established mapping relationship between voiceprint features and VAD truncation times.

In another implementation manner, the first obtaining module 801 may obtain the acquired face image information of the operator of the voice interaction device; carrying out face feature recognition on the face image information to obtain face feature information; and acquiring the VAD cut-off time and dialect type of voice activation detection corresponding to the operator of the voice interaction equipment according to the face feature information.

The acquisition module 802 is configured to acquire first voice information of an operator of the voice interactive apparatus based on the VAD cutoff time.

The recognition module 803 is configured to recognize the first speech information based on a dialect speech recognition model corresponding to the dialect type.

According to the voice recognition device, the voice information of the voice interaction equipment operator can be collected based on the VAD cut-off time corresponding to the voice interaction equipment operator, and the voice information is recognized based on the dialect voice recognition model corresponding to the dialect type used by the voice interaction equipment operator, so that the problems that the voice interaction of the voice interaction equipment operator is slow in speaking and the dialect accent is serious can be effectively solved, and the voice recognition accuracy is improved.

It should be noted that, because the speaking speed of the operator may change with time due to the physiological characteristics of the operator, in order to ensure that the personalized pickup time of the operator is satisfied, the VAD cut-off time most suitable for the operator in the period of time may be predicted at regular intervals by using the speech information and the VAD cut-off time optimization model in the period of time, and then the newly predicted VAD cut-off time is updated into the mapping relationship. In some embodiments of the present application, as shown in fig. 9, on the basis of the above embodiments, the speech recognition apparatus may further include: a second acquisition module 904, a speech processing module 905, a prediction module 906 and an update module 907.

The second obtaining module 904 is configured to obtain third voice information input by an operator of the voice interaction device during a voice interaction process with the voice interaction device within a preset time period; the voice processing module 905 is configured to process the third voice information to obtain a first interval time between the wakeup word and the instruction word and a second interval time between each participle in the instruction word; the prediction module 906 is configured to input the first interval time and the second interval time to the trained VAD cut-off time optimization model to obtain a VAD cut-off time prediction value; the updating module 907 is configured to update the VAD truncation time corresponding to the voiceprint feature information in the mapping relationship based on the VAD truncation time prediction value. In an embodiment of the application, the VAD cut-off time optimization model performs model parameter adjustment through training samples generated by the third speech information.

Wherein, 901-903 in fig. 9 and 801-803 in fig. 8 have the same functions and structures.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 10 is a block diagram of a control device of a voice interaction device according to an embodiment of the present application. As shown in fig. 10, the control device may include: a first obtaining module 1001, a recognition module 1002, a second obtaining module 1003 and a control module 1004.

The first obtaining module 1001 is configured to obtain the collected voice information of the operator of the voice interaction device.

The recognition module 1002 is configured to perform voiceprint feature recognition on the voice information to obtain voiceprint feature information.

The second obtaining module 1003 is configured to determine that the operator of the voice interaction device is a specific operator based on the voiceprint feature information, and obtain, according to the voiceprint feature information, a voice activation detection VAD blocking time and a dialect type corresponding to the specific operator.

The control module 1004 is configured to control the voice interaction device to perform voice interaction with a particular operator based on the VAD cutoff time and the dialect type.

According to the control device provided by the embodiment of the application, the problems of slow speaking and serious dialect accent of the voice interaction of a specific operator in the stage of sharing the voice interaction equipment by multiple persons can be effectively solved, the voice recognition accuracy is improved, and therefore the voice interaction experience can be improved.

It should be noted that the VAD cut-off time can be obtained by using a VAD cut-off time optimization model. The VAD cut-off time optimization model can be trained in advance. In some embodiments of the present application, as shown in fig. 11, on the basis of the above embodiments, the control device may further include: a third obtaining module 1105, a generating module 1106, a predicting module 1107, and a training module 1108.

The third obtaining module 1105 is configured to obtain the recorded voice information of the specific operator in the voice interaction process with the voice interaction device within a preset time period;

the generating module 1106 is configured to generate a training sample according to the input voice information, where the training sample includes a first interval time between a wakeup word and an instruction word, a second interval time between each participle in the instruction word, and a pickup time of the input voice information;

the prediction module 1107 is configured to input a first interval time between the wakeup word and the instruction word and a second interval time between each segment word in the instruction word to a preset VAD cut-off time optimization model to obtain a VAD cut-off time prediction value;

training module 1108 is configured to train a VAD cut-off time optimization model based on the pickup time and the VAD cut-off time prediction value. Wherein 1101-1104 in FIG. 11 and 1001-1004 in FIG. 10 have the same functions and structures.

Fig. 12 is a block diagram of a truncated time optimization model training apparatus according to an embodiment of the present disclosure. The cut-off time optimization model is used for predicting the cut-off time length of the voice activation detection VAD in the voice interaction scene. As shown in fig. 12, the truncation time optimization model training apparatus may include: an acquisition module 1201, a generation module 1202, a prediction module 1203, and a training module 1204.

The obtaining module 1201 is configured to obtain input voice information of a sample user in a voice interaction process with a voice interaction device.

The generating module 1202 is configured to generate a training sample according to the input voice information, where the training sample includes a first interval time between the wakeup word and the instruction word, a second interval time between each participle in the instruction word, and a pickup time of the input voice information.

The prediction module 1203 is configured to input a first interval time between the wakeup word and the instruction word and a second interval time between each participle in the instruction word to the initial model, so as to obtain a VAD truncation time prediction value.

The training module 1204 is configured to train an initial model based on the pickup time and the VAD cut-off time prediction value, obtain a model parameter, and generate a VAD cut-off time optimization model based on the model parameter.

According to the training method of the embodiment of the application, a model capable of predicting VAD cut-off time can be trained, so that the optimal VAD cut-off time most suitable for a voice interaction device operator can be predicted by using the model, the problem that a specific operator speaks slowly in a voice interaction stage can be effectively solved, and the voice of the specific operator is completely picked up by the voice interaction device.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 13, is a block diagram of an electronic device according to an embodiment of the application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, smart speakers, voice interaction devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 13, the electronic apparatus includes: one or more processors 1301, memory 1302, and interfaces for connecting the various components, including high speed interfaces and low speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 13 illustrates an example of a processor 1301.

Memory 1302 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor, so that the at least one processor executes any one or more of the above-mentioned speech recognition method, control method of speech interaction device, and training method of truncation time optimization model provided by the present application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute any one or more of the above-described voice recognition method, control method of a voice interaction device, and truncated time optimization model training method provided herein.

Memory 1302, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the methods described in any of the above embodiments of the present application. The processor 1301 executes non-transitory software programs, instructions and modules stored in the memory 1302 to execute various functional applications of the server and data processing, that is, to implement any one or more of the voice recognition method, the control method of the voice interaction device, and the truncation time optimization model training method in the above method embodiments.

The memory 1302 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 1302 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 1302 may optionally include memory located remotely from processor 1301, which may be connected to an electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device may further include: an input device 1303 and an output device 1304. The processor 1301, the memory 1302, the input device 1303 and the output device 1304 may be connected by a bus or other means, and fig. 13 illustrates the bus connection.

The input device 1303 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus, such as an input device like a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, etc. The output devices 1304 may include a display device, auxiliary lighting devices (e.g., LEDs), tactile feedback devices (e.g., vibrating motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A speech recognition method comprising:

2. The method according to claim 1, wherein the obtaining a voice activation detection VAD cutoff time and a dialect type corresponding to the voice interaction device operator based on identification information of the voice interaction device operator comprises:

acquiring second voice information of the acquired voice interaction equipment operator, and performing voiceprint feature recognition on the second voice information to obtain voiceprint feature information;

and obtaining VAD cut-off time and dialect types corresponding to the voice interaction equipment operator according to the voiceprint characteristic information.

3. The method of claim 2, wherein obtaining a VAD cutoff time corresponding to the voice interaction device operator based on the voiceprint feature information comprises:

and acquiring VAD cut-off time corresponding to the voice interaction equipment operator from a mapping relation between preset voiceprint features and VAD cut-off time according to the voiceprint feature information.

4. The method of claim 3, further comprising:

acquiring third voice information input by the voice interaction equipment operator in the voice interaction process with the voice interaction equipment within a preset time period;

processing the third voice information to obtain first interval time between a wakeup word and an instruction word and second interval time between each participle in the instruction word;

inputting the first interval time and the second interval time into a trained VAD cut-off time optimization model to obtain a VAD cut-off time predicted value;

and updating VAD truncation time corresponding to the voiceprint characteristic information in the mapping relation based on the VAD truncation time predicted value.

5. The method according to claim 4, wherein the VAD cut-off time optimization model makes model parameter adjustments through training samples generated from the third speech information.

6. The method according to claim 1, wherein the obtaining a voice activation detection VAD cutoff time and a dialect type corresponding to the voice interaction device operator based on identification information of the voice interaction device operator comprises:

acquiring the acquired face image information of an operator of the voice interaction equipment;

carrying out face feature recognition on the face image information to obtain face feature information;

and acquiring the voice activation detection VAD cut-off time and dialect type corresponding to the voice interaction equipment operator according to the face feature information.

7. A control method of voice interaction equipment comprises the following steps:

8. The method of claim 7, further comprising:

acquiring recorded voice information of the specific operator in the voice interaction process with the voice interaction equipment within a preset time period;

inputting a first interval time between the awakening word and the instruction word and a second interval time between each participle in the instruction word into a preset VAD cut-off time optimization model to obtain a VAD cut-off time predicted value;

and training the VAD cut-off time optimization model based on the pickup time and the VAD cut-off time predicted value.

9. A truncation time optimization model training method, wherein the truncation time optimization model is used for predicting VAD truncation time length in a voice interaction scenario, and the method comprises:

10. A speech recognition apparatus comprising:

11. The apparatus of claim 10, wherein the first obtaining module is specifically configured to:

12. The apparatus of claim 11, wherein the first obtaining module is specifically configured to:

13. The apparatus of claim 12, further comprising:

the second acquisition module is used for acquiring third voice information input by the voice interaction equipment operator in the voice interaction process with the voice interaction equipment within a preset time period;

the voice processing module is used for processing the third voice information to obtain first interval time between a wakeup word and an instruction word and second interval time between each participle in the instruction word;

the prediction module is used for inputting the first interval time and the second interval time into a trained VAD cut-off time optimization model to obtain a VAD cut-off time prediction value;

and the updating module is used for updating the VAD truncation time corresponding to the voiceprint characteristic information in the mapping relation based on the VAD truncation time predicted value.

14. The apparatus according to claim 13, wherein the VAD cut-off time optimization model makes model parameter adjustments through training samples generated from the third speech information.

15. The apparatus of claim 10, wherein the first obtaining module is specifically configured to:

16. A control apparatus of a voice interaction device, comprising:

17. The apparatus of claim 16, further comprising:

the third acquisition module is used for acquiring the recorded voice information of the specific operator in the voice interaction process with the voice interaction equipment within a preset time period;

the prediction module is used for inputting a first interval time between the awakening word and the instruction word and a second interval time between each participle in the instruction word into a preset VAD cut-off time optimization model to obtain a VAD cut-off time prediction value;

and the training module is used for training the VAD cut-off time optimization model based on the pickup time and the VAD cut-off time predicted value.

18. A truncation time optimization model training apparatus, wherein the truncation time optimization model is used for predicting VAD truncation time length in a speech interaction scenario, the apparatus comprising:

19. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 9.

21. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of any one of claims 1 to 9.