CN110503943B

CN110503943B - Voice interaction method and voice interaction system

Info

Publication number: CN110503943B
Application number: CN201810473045.9A
Authority: CN
Inventors: 孙珏; 徐曼
Original assignee: NIO Anhui Holding Co Ltd
Current assignee: NIO Holding Co Ltd
Priority date: 2018-05-17
Filing date: 2018-05-17
Publication date: 2023-09-19
Anticipated expiration: 2038-05-17
Also published as: CN110503943A

Abstract

The application relates to a voice interaction method and a voice interaction system. The method comprises the following steps: preprocessing the input voice information and outputting a voice segment; a semantic recognition step, namely performing semantic recognition on the voice segment output by the preprocessing step and outputting semantic information; a gender classification step, namely recognizing the gender of the user from the voice segment output by the preprocessing step and outputting gender information; and a fusion processing step of fusing the gender information and the semantic information to obtain personalized reply information for the voice information. According to the voice interaction method and the voice interaction system, the user can conduct distinguishing reply according to the gender of the user, user experience is improved, and the intelligentization of voice interaction is improved.

Description

Voice interaction method and voice interaction system

Technical Field

The present application relates to a voice recognition technology, and more particularly, to a voice interaction method and a voice interaction system capable of recognizing gender of a user.

Background

In a vehicle-mounted dialogue system, the existing voice recognition technology can recognize the voice of a user to a certain extent, but some topics relate to the gender of the user, and the existing voice recognition technology often has difficulty in giving an answer conforming to the gender of the user according to the recognized text.

The information disclosed in the background section of the application is only for enhancement of understanding of the general background of the application and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person of ordinary skill in the art.

Disclosure of Invention

In view of the above, the present application aims to provide a voice interaction method and a voice interaction system capable of recognizing the sex of a user.

The voice interaction method of the application is characterized by comprising the following steps:

preprocessing the input voice information and outputting a voice segment;

a semantic recognition step, namely performing semantic recognition on the voice segment output by the preprocessing step and outputting semantic information;

a gender classification step, namely recognizing the gender of the user from the voice segment output by the preprocessing step and outputting gender information; and

and a fusion processing step of fusing the gender information and the semantic information to obtain personalized reply information for the voice information.

Optionally, the gender analysis step includes:

a model training sub-step of training a long-short-time memory model based on the output acoustic characteristics of the filter and the pre-labeled gender information to obtain the long-short-time memory model; and

and a gender classification sub-step, namely inputting the voice segment into a long-short-time memory model obtained through training and outputting gender classification.

Optionally, in the preprocessing step, an endpoint detection algorithm is used to detect a speech segment for the input speech information.

Optionally, in the preprocessing step, for the input voice information, an endpoint detection algorithm is used to detect a voice segment and output a first voice segment provided for the semantic recognition step and a second voice segment provided for the gender classification step, wherein an endpoint detection boundary of the second voice segment is more strict than an endpoint detection boundary of the first voice segment.

Optionally, the model training substep comprises:

preparing a training set with gender labeling;

extracting output acoustic features of a filter of the training set;

constructing a labeling file corresponding to the output acoustic characteristics of the filter; and

and inputting the output acoustic characteristics of the filter and the annotation file into a long-time and short-time memory model for model training until the model converges.

Optionally, the sex classification substep includes:

inputting the voice segment into a long-short-time memory model obtained through training;

forward calculation is carried out to obtain posterior probabilities of different classification sexes; and

the posterior probability for a predetermined period of time is accumulated to obtain a gender classification result.

The voice interaction system of the present application is characterized by comprising:

the preprocessing module is used for preprocessing the input voice information and outputting voice segments;

the semantic recognition module is used for carrying out semantic recognition on the voice segment output by the preprocessing module and outputting semantic information;

the gender classification module is used for classifying the gender of the voice segment output by the preprocessing module, identifying the gender of the user and outputting gender information; and

and the fusion processing module is used for fusing the gender information and the semantic information to obtain personalized reply information for the voice information.

Optionally, the gender classification module includes:

the model training sub-module is used for training the long-short-time memory model based on the output acoustic characteristics of the filter and the pre-labeled gender information to obtain the long-short-time memory model; and

and the gender classification sub-module is used for inputting the voice segment into the long-short-time memory model obtained through training and outputting gender classification.

Optionally, in the preprocessing module, for the input voice information, an endpoint detection algorithm is used to detect a voice segment.

Optionally, the preprocessing module performs voice segment detection on the input voice information using an endpoint detection algorithm and outputs a first voice segment provided to the semantic recognition module and a second voice segment provided to the gender classification module,

wherein the end-point detection boundary of the second speech segment is more stringent than the end-point detection boundary of the first speech segment.

Optionally, the model training submodule extracts the output acoustic characteristics of the filter of the training set based on the training set with gender labeling, constructs a labeling file corresponding to the output acoustic characteristics of the filter, and inputs the output acoustic characteristics of the filter and the labeling file into a long-time and short-time memory model for model training until the model converges.

Optionally, the gender classification sub-module inputs the voice segment into a long-short-time memory model obtained through training, obtains posterior probabilities of different classification sexes through forward calculation, and accumulates the posterior probabilities of a specified time to obtain gender classification results.

The voice interaction method of the application is applied to a vehicle, or the voice interaction system of the application is applied to a vehicle.

The application also provides voice interaction equipment which can execute the voice interaction method or comprises the voice interaction system.

Optionally, the voice interaction device is disposed on a vehicle.

The application provides a controller which comprises a storage component, a processing component and an instruction which is stored on the storage component and can be operated by the processing component, and is characterized in that the processing component realizes the voice interaction method when the instruction is operated. According to the voice interaction method and the voice interaction system, by combining semantic analysis and gender classification, the distinction reply can be carried out according to the gender of the user, the user experience is improved, and the intelligentization of voice interaction is improved.

Other features and advantages of the methods and apparatus of the present application will be apparent from or elucidated with reference to the drawings, taken in conjunction with the accompanying drawings, and the detailed description which follows in conjunction with the accompanying drawings, serve to illustrate certain principles of the application.

Drawings

Fig. 1 is a flowchart showing a voice interaction method according to an embodiment of the present application.

Fig. 2 is a schematic illustration of a specific flow of the gender classification step.

Fig. 3 is a block diagram showing the construction of a voice interaction system according to an embodiment of the present application.

Detailed Description

The following presents a simplified summary of the application in order to provide a basic understanding of the application. It is not intended to identify key or critical elements of the application or to delineate the scope of the application.

First, some terms that will appear hereinafter will be explained.

nlu: natural language understanding;

asr: automatic speech recognition;

long and short term memory model (LSTM): a long-time short-time memory model, a deep learning model, which can learn long-term dependence information;

features: filter bank characteristic parameters of the audio file;

cmvn: statistical information of the characteristic files;

gmm-hmm: one conventional acoustic model, a hidden markov model based on a mixture gaussian model.

Fig. 1 is a flowchart of a voice interaction method according to an embodiment of the present application.

Referring to fig. 1, the voice interaction method according to an embodiment of the present application includes the following steps:

input step S100: inputting voice information;

pretreatment step S200: preprocessing the voice information input in the input step S100 and outputting voice segments;

semantic recognition step S300: carrying out semantic recognition on the voice segment output by the preprocessing step S200 and outputting semantic information;

gender classification step S400: performing gender classification on the voice segment output by the preprocessing step S200, identifying the gender of the user and outputting gender information;

fusion processing step S500: fusing the gender information and the semantic information to obtain personalized reply information for the input voice information; and

an output step 600: to output the personalized reply message. For example, the output may be in a voice manner or in a text manner.

Next, an exemplary explanation is given of the preprocessing step S200, the sex classification step S400, and the fusion processing step S500. In the semantic recognition step S300, the semantic recognition of the speech segment and the output of the semantic information may be performed by the same technical means as those of the conventional technique, and the description thereof will be omitted.

As an example, in a preprocessing step S200, for the input speech information, an end-point detection algorithm (VAD) is used to detect the speech information to obtain speech segments. For example, the voice information of the user is input into a VAD model, which obtains the voice segments by means of endpoint detection, feature extraction, etc. The obtained speech segments are provided to the subsequent semantic recognition step S300 and the gender classification step S400, respectively. The voice recognition task requires that complete text information is reserved as far as possible, and the boundary of the VAD should be more tolerant; while the gender classification task requires that all silence (silence) be eliminated as much as possible, the boundaries of the VAD should be more stringent. Thus, two different speech segments are optionally provided separately to the subsequent semantic recognition step S300 and the gender classification step S400 at the preprocessing step S200.

Next, a sex classification step S400 will be described.

Fig. 2 is a specific flowchart of the sex classification step S400.

As shown in fig. 2, the gender classification step S400 may be roughly divided into a training phase and an identification phase.

First, a training phase will be described.

A training set with gender labeling needs to be prepared as training samples, including wav.scp, utt spline, text, and gender information corresponding to each language (utterance), features of the training set (i.e., filter bank feature parameters of audio files) are extracted, output acoustic features of filters (i.e., filter bank feature, output acoustic feature of filters) and cmvn in fig. 2 are prepared for training long and short time memory models.

Because the gender model is a classification model, the annotation files (namely, the FAs in fig. 2) corresponding to the features need to be constructed, and the annotation files FA are only aimed at the voice segments of the features, and a batch of annotation files FA reflecting the gender of the feature files are constructed according to the frame number of the features.

And inputting the prepared feature files feats and the annotation files FA into the long-short-time memory model for training until convergence. Here, LSTM (Long-Short Term Memory) is one of recurrent neural networks (RNN: recurrent Neutral Network). RNNs, also called recurrent neural network sequences, are special neural networks that call themselves according to time sequences or character sequences, which are developed in sequence to become common three-layer neural networks, often used for speech recognition.

Here, the basic parameters adopted by the long-short-time memory model are:

num-lstm-layers: 1；

cell-dim: 1024；

lstm-delay: -1。

next, the identification phase will be described.

First, feature extraction is required. When a user speaks, the speech information is first detected using an end-point detection algorithm (VAD), and feature extraction is performed on non-silence (non-silence) speech frames detected by the VAD. Since the long-short-term memory model is a model depending on the past time, a buffer may be provided for feature accumulation.

Then, forward computation is performed. And sending the feature matrix with a certain length into a long-short time memory model, and obtaining posterior probabilities of different classification sexes through forward calculation. The posterior probability is a probability of being corrected again after obtaining information of "result", and is a "fruit" in the problem of "cause of execution". The probability that a thing has not yet occurred, which is required to be the prior probability; what has happened, the reason why this is required to happen is the magnitude of the probability caused by a certain factor, namely the posterior probability.

Finally, posterior processing is performed. And setting a time threshold T through repeated experiments, comparing posterior probability values of accumulated T duration, and taking the category with a larger probability value as the gender classification result of the input audio. Here, the time threshold T may be, for example, 0.5s or 1 s. The time threshold T cannot be set too long because more data would be needed, the real-time nature of the recognition becomes not high, but also cannot be set too short, as the accuracy may not be high enough.

In this way, the voice segment is semantically recognized and semantic information is output through the semantic recognition step S300, on the other hand, the voice segment is sexually classified and the user sex is recognized and the sex information is output through the sex classification step S400, and then the recognized sex information and semantic information are fused in the fusion processing step S500, so that personalized reply information for the input voice information is obtained. In some examples of the present application, the "fusion" mentioned in step S500 may be understood as taking into account the gender information obtained in step S400 when performing the voice interaction information, for example, to make the reply more targeted or more appropriate, as several examples are given below. But the case of other application of neutral information in step S400 is not excluded.

For example, when the voice input by the user is "get good early-! "when the sex classification step S400 is recognized as a male, then" mr. Is output-! "when the female is identified in the sex classification step S400," female is output! "; when the voice input by the user is "do you feel me nice", and when the sex classification step S400 is recognized as a male, then "of course, you are marshal go-! "when the female is identified in the sex classification step S400," of course, you are a large beauty-! "; when the voice input by the user is "now several points", when the sex classification step S400 is recognized as a male, then "mr. Is now 3 pm" is output, and when the sex classification step S400 is recognized as a female, then "woman is now 3 pm" is output.

The embodiments of the voice interaction method of the present application are described above. Next, for the voice interactive system of the present application,

As shown in fig. 3, a voice interaction system according to an embodiment of the present application includes:

an input module 100 for inputting voice information;

the preprocessing module 200 is used for receiving and preprocessing voice information and outputting voice segments;

the gender classification module 300 is used for performing gender classification on the voice segment output by the preprocessing module, identifying the gender of the user and outputting gender information;

the semantic recognition module 400 is used for carrying out semantic recognition on the voice segment output by the preprocessing module and outputting semantic information;

the fusion processing module 500 is configured to fuse the gender information and the semantic information to obtain personalized reply information for the voice information; and

and the output module 600 is used for outputting the personalized reply information in a voice way.

The preprocessing module 200 performs voice segment detection using an end point detection algorithm (VAD) on the input voice information, and in particular, the preprocessing module 200 performs voice segment detection using an end point detection algorithm on the input voice information and outputs a first voice segment provided to the gender classification module 300 and a second voice segment provided to the semantic recognition module 400, wherein boundaries of the VAD should be stricter because the gender classification module requires that all silence segments be removed as much as possible, and boundaries of the VAD should be more tolerant because the semantic recognition module 400 requires that complete text information be preserved as much as possible, and therefore, the end point detection boundaries of the first voice segment are stricter than the end point detection boundaries of the second voice segment.

Wherein the gender classification module 300 comprises:

the model training sub-module 310 is configured to perform long-short-time memory model training based on the output acoustic features of the filter and pre-labeled gender information to obtain a long-short-time memory model; and

the gender classification sub-module 320 is configured to input the speech segment into a long-short-term memory model obtained through training and output gender classification.

The model training submodule 410 extracts output acoustic features of a filter of the training set based on the training set with gender labeling, constructs a labeling file FA corresponding to the output acoustic features of the filter, and inputs the output acoustic features of the filter and the labeling file into a long-time and short-time memory model for model training until the model converges. The gender classification sub-module 420 inputs the voice segments into a long and short time memory model obtained through training, obtains posterior probabilities of different classification sexes through forward calculation, and accumulates the posterior probabilities for a prescribed time period to obtain gender classification results.

The voice interaction method described in any of the above examples can be applied to a vehicle, or the voice interaction system described in any of the above examples can be applied to a vehicle. For example, as part of a vehicle control method or vehicle control system.

The present application also provides a voice interaction device capable of performing the voice interaction method as described in any of the examples above; alternatively, it comprises a voice interaction system as described in any of the examples above. The voice interaction device can be implemented separately as a component which can be provided in a vehicle, for example, so that a person in the vehicle can interact with it in voice. The voice interaction device may be a device fixed to the vehicle or a device capable of being taken from/put back into the vehicle. And further, in some examples, the voice interaction device is capable of communicating with an electronic control system within the vehicle. In some cases, the voice interaction device may also be implemented in an existing electronic component of the vehicle, such as an infotainment system of the vehicle, etc.

The application also provides a controller which comprises a storage component, a processing component and an instruction which is stored on the storage component and can be operated by the processing component, and is characterized in that the processing component realizes the voice interaction method when the instruction is operated.

According to the voice interaction method and the voice interaction system of the examples, by combining semantic analysis and gender classification, the user can conduct distinguishing reply according to the gender of the user, user experience is improved, and the intelligentization of voice interaction is improved.

The above examples mainly illustrate the voice interaction method and the voice interaction system of the present application. Although only a few specific embodiments of the present application have been described, those skilled in the art will appreciate that the present application may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and the application is intended to cover various modifications and substitutions without departing from the spirit and scope of the application as defined by the appended claims.

Claims

1. A method of voice interaction, comprising:

preprocessing the input voice information and outputting a first voice segment and a second voice segment;

a semantic recognition step, namely performing semantic recognition on the first voice segment output by the preprocessing step and outputting semantic information;

a gender classification step, namely identifying the gender of the user and outputting gender information to the second voice segment output by the preprocessing step; and

a fusion processing step of fusing the sex information and the semantic information to obtain personalized reply information to the voice information,

wherein in the preprocessing step, for the input voice information, detection of a voice segment is performed using an end point detection algorithm and the first voice segment supplied to the semantic recognition step and the second voice segment supplied to the gender classification step are output,

wherein the first speech segment is different from the second speech segment, wherein the first speech segment is provided such that complete text information is preserved and the second speech segment is provided such that all silence is rejected, and wherein the end point detection boundary of the second speech segment is more stringent than the end point detection boundary of the first speech segment.

2. The voice interaction method of claim 1, wherein the gender classification step comprises:

3. The voice interaction method of claim 2, wherein the model training substep comprises:

preparing a training set with gender labeling;

extracting output acoustic features of a filter of the training set;

4. The voice interaction method of claim 2, wherein the gender classification sub-step comprises:

5. A voice interactive system, comprising:

the preprocessing module is used for preprocessing the input voice information and outputting a first voice segment and a second voice segment;

the semantic recognition module is used for carrying out semantic recognition on the first voice segment output by the preprocessing module and outputting semantic information;

the gender classification module is used for carrying out gender classification on the second voice segment output by the preprocessing module, identifying the gender of the user and outputting gender information; and

a fusion processing module for fusing the gender information and the semantic information to obtain personalized reply information for the voice information,

wherein, in the preprocessing module, for the input voice information, a voice segment is detected using an end point detection algorithm and the first voice segment provided to the semantic recognition module and the second voice segment provided to the gender classification module are output,

6. The voice interactive system of claim 5, wherein the gender classification module comprises:

7. The voice interactive system of claim 6, wherein,

and the model training submodule extracts the output acoustic characteristics of the filter of the training set based on the training set with gender marking, constructs a marking file corresponding to the output acoustic characteristics of the filter, and inputs the output acoustic characteristics of the filter and the marking file into a long-time and short-time memory model for model training until the model converges.

8. The voice interactive system of claim 5, wherein the gender classification sub-module inputs the voice segments into a long-short-term memory model obtained through training, obtains posterior probabilities of different classification sexes through forward calculation and accumulates posterior probabilities for a prescribed time period to obtain gender classification results.

9. A voice interaction method as claimed in any one of claims 1 to 4 or a voice interaction system as claimed in any one of claims 5 to 8 for use in a vehicle.

10. A voice interaction device capable of performing the voice interaction method of any one of claims 1 to 4 or comprising the voice interaction system of any one of claims 5 to 8.

11. The voice interaction device of claim 10, disposed on a vehicle.

12. A controller comprising a storage means, a processing means and instructions stored on the storage means and executable by the processing means, wherein the processing means implements the voice interaction method of any one of claims 1 to 4 when the instructions are executed.