CN110503943B - Voice interaction method and voice interaction system - Google Patents
Voice interaction method and voice interaction system Download PDFInfo
- Publication number
- CN110503943B CN110503943B CN201810473045.9A CN201810473045A CN110503943B CN 110503943 B CN110503943 B CN 110503943B CN 201810473045 A CN201810473045 A CN 201810473045A CN 110503943 B CN110503943 B CN 110503943B
- Authority
- CN
- China
- Prior art keywords
- voice
- information
- gender
- segment
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 59
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000007781 pre-processing Methods 0.000 claims abstract description 35
- 238000007499 fusion processing Methods 0.000 claims abstract description 9
- 238000012549 training Methods 0.000 claims description 47
- 230000015654 memory Effects 0.000 claims description 32
- 238000001514 detection method Methods 0.000 claims description 25
- 238000002372 labelling Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000002452 interceptive effect Effects 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000007935 neutral effect Effects 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- JLQUFIHWVLZVTJ-UHFFFAOYSA-N carbosulfan Chemical compound CCCCN(CCCC)SN(C)C(=O)OC1=CC=CC2=C1OC(C)(C)C2 JLQUFIHWVLZVTJ-UHFFFAOYSA-N 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The application relates to a voice interaction method and a voice interaction system. The method comprises the following steps: preprocessing the input voice information and outputting a voice segment; a semantic recognition step, namely performing semantic recognition on the voice segment output by the preprocessing step and outputting semantic information; a gender classification step, namely recognizing the gender of the user from the voice segment output by the preprocessing step and outputting gender information; and a fusion processing step of fusing the gender information and the semantic information to obtain personalized reply information for the voice information. According to the voice interaction method and the voice interaction system, the user can conduct distinguishing reply according to the gender of the user, user experience is improved, and the intelligentization of voice interaction is improved.
Description
Technical Field
The present application relates to a voice recognition technology, and more particularly, to a voice interaction method and a voice interaction system capable of recognizing gender of a user.
Background
In a vehicle-mounted dialogue system, the existing voice recognition technology can recognize the voice of a user to a certain extent, but some topics relate to the gender of the user, and the existing voice recognition technology often has difficulty in giving an answer conforming to the gender of the user according to the recognized text.
The information disclosed in the background section of the application is only for enhancement of understanding of the general background of the application and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person of ordinary skill in the art.
Disclosure of Invention
In view of the above, the present application aims to provide a voice interaction method and a voice interaction system capable of recognizing the sex of a user.
The voice interaction method of the application is characterized by comprising the following steps:
preprocessing the input voice information and outputting a voice segment;
a semantic recognition step, namely performing semantic recognition on the voice segment output by the preprocessing step and outputting semantic information;
a gender classification step, namely recognizing the gender of the user from the voice segment output by the preprocessing step and outputting gender information; and
and a fusion processing step of fusing the gender information and the semantic information to obtain personalized reply information for the voice information.
Optionally, the gender analysis step includes:
a model training sub-step of training a long-short-time memory model based on the output acoustic characteristics of the filter and the pre-labeled gender information to obtain the long-short-time memory model; and
and a gender classification sub-step, namely inputting the voice segment into a long-short-time memory model obtained through training and outputting gender classification.
Optionally, in the preprocessing step, an endpoint detection algorithm is used to detect a speech segment for the input speech information.
Optionally, in the preprocessing step, for the input voice information, an endpoint detection algorithm is used to detect a voice segment and output a first voice segment provided for the semantic recognition step and a second voice segment provided for the gender classification step, wherein an endpoint detection boundary of the second voice segment is more strict than an endpoint detection boundary of the first voice segment.
Optionally, the model training substep comprises:
preparing a training set with gender labeling;
extracting output acoustic features of a filter of the training set;
constructing a labeling file corresponding to the output acoustic characteristics of the filter; and
and inputting the output acoustic characteristics of the filter and the annotation file into a long-time and short-time memory model for model training until the model converges.
Optionally, the sex classification substep includes:
inputting the voice segment into a long-short-time memory model obtained through training;
forward calculation is carried out to obtain posterior probabilities of different classification sexes; and
the posterior probability for a predetermined period of time is accumulated to obtain a gender classification result.
The voice interaction system of the present application is characterized by comprising:
the preprocessing module is used for preprocessing the input voice information and outputting voice segments;
the semantic recognition module is used for carrying out semantic recognition on the voice segment output by the preprocessing module and outputting semantic information;
the gender classification module is used for classifying the gender of the voice segment output by the preprocessing module, identifying the gender of the user and outputting gender information; and
and the fusion processing module is used for fusing the gender information and the semantic information to obtain personalized reply information for the voice information.
Optionally, the gender classification module includes:
the model training sub-module is used for training the long-short-time memory model based on the output acoustic characteristics of the filter and the pre-labeled gender information to obtain the long-short-time memory model; and
and the gender classification sub-module is used for inputting the voice segment into the long-short-time memory model obtained through training and outputting gender classification.
Optionally, in the preprocessing module, for the input voice information, an endpoint detection algorithm is used to detect a voice segment.
Optionally, the preprocessing module performs voice segment detection on the input voice information using an endpoint detection algorithm and outputs a first voice segment provided to the semantic recognition module and a second voice segment provided to the gender classification module,
wherein the end-point detection boundary of the second speech segment is more stringent than the end-point detection boundary of the first speech segment.
Optionally, the model training submodule extracts the output acoustic characteristics of the filter of the training set based on the training set with gender labeling, constructs a labeling file corresponding to the output acoustic characteristics of the filter, and inputs the output acoustic characteristics of the filter and the labeling file into a long-time and short-time memory model for model training until the model converges.
Optionally, the gender classification sub-module inputs the voice segment into a long-short-time memory model obtained through training, obtains posterior probabilities of different classification sexes through forward calculation, and accumulates the posterior probabilities of a specified time to obtain gender classification results.
The voice interaction method of the application is applied to a vehicle, or the voice interaction system of the application is applied to a vehicle.
The application also provides voice interaction equipment which can execute the voice interaction method or comprises the voice interaction system.
Optionally, the voice interaction device is disposed on a vehicle.
The application provides a controller which comprises a storage component, a processing component and an instruction which is stored on the storage component and can be operated by the processing component, and is characterized in that the processing component realizes the voice interaction method when the instruction is operated. According to the voice interaction method and the voice interaction system, by combining semantic analysis and gender classification, the distinction reply can be carried out according to the gender of the user, the user experience is improved, and the intelligentization of voice interaction is improved.
Other features and advantages of the methods and apparatus of the present application will be apparent from or elucidated with reference to the drawings, taken in conjunction with the accompanying drawings, and the detailed description which follows in conjunction with the accompanying drawings, serve to illustrate certain principles of the application.
Drawings
Fig. 1 is a flowchart showing a voice interaction method according to an embodiment of the present application.
Fig. 2 is a schematic illustration of a specific flow of the gender classification step.
Fig. 3 is a block diagram showing the construction of a voice interaction system according to an embodiment of the present application.
Detailed Description
The following presents a simplified summary of the application in order to provide a basic understanding of the application. It is not intended to identify key or critical elements of the application or to delineate the scope of the application.
First, some terms that will appear hereinafter will be explained.
nlu: natural language understanding;
asr: automatic speech recognition;
long and short term memory model (LSTM): a long-time short-time memory model, a deep learning model, which can learn long-term dependence information;
features: filter bank characteristic parameters of the audio file;
cmvn: statistical information of the characteristic files;
gmm-hmm: one conventional acoustic model, a hidden markov model based on a mixture gaussian model.
Fig. 1 is a flowchart of a voice interaction method according to an embodiment of the present application.
Referring to fig. 1, the voice interaction method according to an embodiment of the present application includes the following steps:
input step S100: inputting voice information;
pretreatment step S200: preprocessing the voice information input in the input step S100 and outputting voice segments;
semantic recognition step S300: carrying out semantic recognition on the voice segment output by the preprocessing step S200 and outputting semantic information;
gender classification step S400: performing gender classification on the voice segment output by the preprocessing step S200, identifying the gender of the user and outputting gender information;
fusion processing step S500: fusing the gender information and the semantic information to obtain personalized reply information for the input voice information; and
an output step 600: to output the personalized reply message. For example, the output may be in a voice manner or in a text manner.
Next, an exemplary explanation is given of the preprocessing step S200, the sex classification step S400, and the fusion processing step S500. In the semantic recognition step S300, the semantic recognition of the speech segment and the output of the semantic information may be performed by the same technical means as those of the conventional technique, and the description thereof will be omitted.
As an example, in a preprocessing step S200, for the input speech information, an end-point detection algorithm (VAD) is used to detect the speech information to obtain speech segments. For example, the voice information of the user is input into a VAD model, which obtains the voice segments by means of endpoint detection, feature extraction, etc. The obtained speech segments are provided to the subsequent semantic recognition step S300 and the gender classification step S400, respectively. The voice recognition task requires that complete text information is reserved as far as possible, and the boundary of the VAD should be more tolerant; while the gender classification task requires that all silence (silence) be eliminated as much as possible, the boundaries of the VAD should be more stringent. Thus, two different speech segments are optionally provided separately to the subsequent semantic recognition step S300 and the gender classification step S400 at the preprocessing step S200.
Next, a sex classification step S400 will be described.
Fig. 2 is a specific flowchart of the sex classification step S400.
As shown in fig. 2, the gender classification step S400 may be roughly divided into a training phase and an identification phase.
First, a training phase will be described.
A training set with gender labeling needs to be prepared as training samples, including wav.scp, utt spline, text, and gender information corresponding to each language (utterance), features of the training set (i.e., filter bank feature parameters of audio files) are extracted, output acoustic features of filters (i.e., filter bank feature, output acoustic feature of filters) and cmvn in fig. 2 are prepared for training long and short time memory models.
Because the gender model is a classification model, the annotation files (namely, the FAs in fig. 2) corresponding to the features need to be constructed, and the annotation files FA are only aimed at the voice segments of the features, and a batch of annotation files FA reflecting the gender of the feature files are constructed according to the frame number of the features.
And inputting the prepared feature files feats and the annotation files FA into the long-short-time memory model for training until convergence. Here, LSTM (Long-Short Term Memory) is one of recurrent neural networks (RNN: recurrent Neutral Network). RNNs, also called recurrent neural network sequences, are special neural networks that call themselves according to time sequences or character sequences, which are developed in sequence to become common three-layer neural networks, often used for speech recognition.
Here, the basic parameters adopted by the long-short-time memory model are:
num-lstm-layers: 1;
cell-dim: 1024;
lstm-delay: -1。
next, the identification phase will be described.
First, feature extraction is required. When a user speaks, the speech information is first detected using an end-point detection algorithm (VAD), and feature extraction is performed on non-silence (non-silence) speech frames detected by the VAD. Since the long-short-term memory model is a model depending on the past time, a buffer may be provided for feature accumulation.
Then, forward computation is performed. And sending the feature matrix with a certain length into a long-short time memory model, and obtaining posterior probabilities of different classification sexes through forward calculation. The posterior probability is a probability of being corrected again after obtaining information of "result", and is a "fruit" in the problem of "cause of execution". The probability that a thing has not yet occurred, which is required to be the prior probability; what has happened, the reason why this is required to happen is the magnitude of the probability caused by a certain factor, namely the posterior probability.
Finally, posterior processing is performed. And setting a time threshold T through repeated experiments, comparing posterior probability values of accumulated T duration, and taking the category with a larger probability value as the gender classification result of the input audio. Here, the time threshold T may be, for example, 0.5s or 1 s. The time threshold T cannot be set too long because more data would be needed, the real-time nature of the recognition becomes not high, but also cannot be set too short, as the accuracy may not be high enough.
In this way, the voice segment is semantically recognized and semantic information is output through the semantic recognition step S300, on the other hand, the voice segment is sexually classified and the user sex is recognized and the sex information is output through the sex classification step S400, and then the recognized sex information and semantic information are fused in the fusion processing step S500, so that personalized reply information for the input voice information is obtained. In some examples of the present application, the "fusion" mentioned in step S500 may be understood as taking into account the gender information obtained in step S400 when performing the voice interaction information, for example, to make the reply more targeted or more appropriate, as several examples are given below. But the case of other application of neutral information in step S400 is not excluded.
For example, when the voice input by the user is "get good early-! "when the sex classification step S400 is recognized as a male, then" mr. Is output-! "when the female is identified in the sex classification step S400," female is output! "; when the voice input by the user is "do you feel me nice", and when the sex classification step S400 is recognized as a male, then "of course, you are marshal go-! "when the female is identified in the sex classification step S400," of course, you are a large beauty-! "; when the voice input by the user is "now several points", when the sex classification step S400 is recognized as a male, then "mr. Is now 3 pm" is output, and when the sex classification step S400 is recognized as a female, then "woman is now 3 pm" is output.
The embodiments of the voice interaction method of the present application are described above. Next, for the voice interactive system of the present application,
fig. 3 is a block diagram showing the construction of a voice interaction system according to an embodiment of the present application.
As shown in fig. 3, a voice interaction system according to an embodiment of the present application includes:
an input module 100 for inputting voice information;
the preprocessing module 200 is used for receiving and preprocessing voice information and outputting voice segments;
the gender classification module 300 is used for performing gender classification on the voice segment output by the preprocessing module, identifying the gender of the user and outputting gender information;
the semantic recognition module 400 is used for carrying out semantic recognition on the voice segment output by the preprocessing module and outputting semantic information;
the fusion processing module 500 is configured to fuse the gender information and the semantic information to obtain personalized reply information for the voice information; and
and the output module 600 is used for outputting the personalized reply information in a voice way.
The preprocessing module 200 performs voice segment detection using an end point detection algorithm (VAD) on the input voice information, and in particular, the preprocessing module 200 performs voice segment detection using an end point detection algorithm on the input voice information and outputs a first voice segment provided to the gender classification module 300 and a second voice segment provided to the semantic recognition module 400, wherein boundaries of the VAD should be stricter because the gender classification module requires that all silence segments be removed as much as possible, and boundaries of the VAD should be more tolerant because the semantic recognition module 400 requires that complete text information be preserved as much as possible, and therefore, the end point detection boundaries of the first voice segment are stricter than the end point detection boundaries of the second voice segment.
Wherein the gender classification module 300 comprises:
the model training sub-module 310 is configured to perform long-short-time memory model training based on the output acoustic features of the filter and pre-labeled gender information to obtain a long-short-time memory model; and
the gender classification sub-module 320 is configured to input the speech segment into a long-short-term memory model obtained through training and output gender classification.
The model training submodule 410 extracts output acoustic features of a filter of the training set based on the training set with gender labeling, constructs a labeling file FA corresponding to the output acoustic features of the filter, and inputs the output acoustic features of the filter and the labeling file into a long-time and short-time memory model for model training until the model converges. The gender classification sub-module 420 inputs the voice segments into a long and short time memory model obtained through training, obtains posterior probabilities of different classification sexes through forward calculation, and accumulates the posterior probabilities for a prescribed time period to obtain gender classification results.
The voice interaction method described in any of the above examples can be applied to a vehicle, or the voice interaction system described in any of the above examples can be applied to a vehicle. For example, as part of a vehicle control method or vehicle control system.
The present application also provides a voice interaction device capable of performing the voice interaction method as described in any of the examples above; alternatively, it comprises a voice interaction system as described in any of the examples above. The voice interaction device can be implemented separately as a component which can be provided in a vehicle, for example, so that a person in the vehicle can interact with it in voice. The voice interaction device may be a device fixed to the vehicle or a device capable of being taken from/put back into the vehicle. And further, in some examples, the voice interaction device is capable of communicating with an electronic control system within the vehicle. In some cases, the voice interaction device may also be implemented in an existing electronic component of the vehicle, such as an infotainment system of the vehicle, etc.
The application also provides a controller which comprises a storage component, a processing component and an instruction which is stored on the storage component and can be operated by the processing component, and is characterized in that the processing component realizes the voice interaction method when the instruction is operated.
According to the voice interaction method and the voice interaction system of the examples, by combining semantic analysis and gender classification, the user can conduct distinguishing reply according to the gender of the user, user experience is improved, and the intelligentization of voice interaction is improved.
The above examples mainly illustrate the voice interaction method and the voice interaction system of the present application. Although only a few specific embodiments of the present application have been described, those skilled in the art will appreciate that the present application may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and the application is intended to cover various modifications and substitutions without departing from the spirit and scope of the application as defined by the appended claims.
Claims (12)
1. A method of voice interaction, comprising:
preprocessing the input voice information and outputting a first voice segment and a second voice segment;
a semantic recognition step, namely performing semantic recognition on the first voice segment output by the preprocessing step and outputting semantic information;
a gender classification step, namely identifying the gender of the user and outputting gender information to the second voice segment output by the preprocessing step; and
a fusion processing step of fusing the sex information and the semantic information to obtain personalized reply information to the voice information,
wherein in the preprocessing step, for the input voice information, detection of a voice segment is performed using an end point detection algorithm and the first voice segment supplied to the semantic recognition step and the second voice segment supplied to the gender classification step are output,
wherein the first speech segment is different from the second speech segment, wherein the first speech segment is provided such that complete text information is preserved and the second speech segment is provided such that all silence is rejected, and wherein the end point detection boundary of the second speech segment is more stringent than the end point detection boundary of the first speech segment.
2. The voice interaction method of claim 1, wherein the gender classification step comprises:
a model training sub-step of training a long-short-time memory model based on the output acoustic characteristics of the filter and the pre-labeled gender information to obtain the long-short-time memory model; and
and a gender classification sub-step, namely inputting the voice segment into a long-short-time memory model obtained through training and outputting gender classification.
3. The voice interaction method of claim 2, wherein the model training substep comprises:
preparing a training set with gender labeling;
extracting output acoustic features of a filter of the training set;
constructing a labeling file corresponding to the output acoustic characteristics of the filter; and
and inputting the output acoustic characteristics of the filter and the annotation file into a long-time and short-time memory model for model training until the model converges.
4. The voice interaction method of claim 2, wherein the gender classification sub-step comprises:
inputting the voice segment into a long-short-time memory model obtained through training;
forward calculation is carried out to obtain posterior probabilities of different classification sexes; and
the posterior probability for a predetermined period of time is accumulated to obtain a gender classification result.
5. A voice interactive system, comprising:
the preprocessing module is used for preprocessing the input voice information and outputting a first voice segment and a second voice segment;
the semantic recognition module is used for carrying out semantic recognition on the first voice segment output by the preprocessing module and outputting semantic information;
the gender classification module is used for carrying out gender classification on the second voice segment output by the preprocessing module, identifying the gender of the user and outputting gender information; and
a fusion processing module for fusing the gender information and the semantic information to obtain personalized reply information for the voice information,
wherein, in the preprocessing module, for the input voice information, a voice segment is detected using an end point detection algorithm and the first voice segment provided to the semantic recognition module and the second voice segment provided to the gender classification module are output,
wherein the first speech segment is different from the second speech segment, wherein the first speech segment is provided such that complete text information is preserved and the second speech segment is provided such that all silence is rejected, and wherein the end point detection boundary of the second speech segment is more stringent than the end point detection boundary of the first speech segment.
6. The voice interactive system of claim 5, wherein the gender classification module comprises:
the model training sub-module is used for training the long-short-time memory model based on the output acoustic characteristics of the filter and the pre-labeled gender information to obtain the long-short-time memory model; and
and the gender classification sub-module is used for inputting the voice segment into the long-short-time memory model obtained through training and outputting gender classification.
7. The voice interactive system of claim 6, wherein,
and the model training submodule extracts the output acoustic characteristics of the filter of the training set based on the training set with gender marking, constructs a marking file corresponding to the output acoustic characteristics of the filter, and inputs the output acoustic characteristics of the filter and the marking file into a long-time and short-time memory model for model training until the model converges.
8. The voice interactive system of claim 5, wherein the gender classification sub-module inputs the voice segments into a long-short-term memory model obtained through training, obtains posterior probabilities of different classification sexes through forward calculation and accumulates posterior probabilities for a prescribed time period to obtain gender classification results.
9. A voice interaction method as claimed in any one of claims 1 to 4 or a voice interaction system as claimed in any one of claims 5 to 8 for use in a vehicle.
10. A voice interaction device capable of performing the voice interaction method of any one of claims 1 to 4 or comprising the voice interaction system of any one of claims 5 to 8.
11. The voice interaction device of claim 10, disposed on a vehicle.
12. A controller comprising a storage means, a processing means and instructions stored on the storage means and executable by the processing means, wherein the processing means implements the voice interaction method of any one of claims 1 to 4 when the instructions are executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810473045.9A CN110503943B (en) | 2018-05-17 | 2018-05-17 | Voice interaction method and voice interaction system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810473045.9A CN110503943B (en) | 2018-05-17 | 2018-05-17 | Voice interaction method and voice interaction system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110503943A CN110503943A (en) | 2019-11-26 |
CN110503943B true CN110503943B (en) | 2023-09-19 |
Family
ID=68583957
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810473045.9A Active CN110503943B (en) | 2018-05-17 | 2018-05-17 | Voice interaction method and voice interaction system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110503943B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111883133B (en) * | 2020-07-20 | 2023-08-29 | 深圳乐信软件技术有限公司 | Customer service voice recognition method, customer service voice recognition device, server and storage medium |
CN112397067A (en) * | 2020-11-13 | 2021-02-23 | 重庆长安工业(集团)有限责任公司 | Voice control terminal of weapon equipment |
CN113870861A (en) * | 2021-09-10 | 2021-12-31 | Oppo广东移动通信有限公司 | Voice interaction method and device, storage medium and terminal |
CN116092056B (en) * | 2023-03-06 | 2023-07-07 | 安徽蔚来智驾科技有限公司 | Target recognition method, vehicle control method, device, medium and vehicle |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105700682A (en) * | 2016-01-08 | 2016-06-22 | 北京乐驾科技有限公司 | Intelligent gender and emotion recognition detection system and method based on vision and voice |
CN107146615A (en) * | 2017-05-16 | 2017-09-08 | 南京理工大学 | Audio recognition method and system based on the secondary identification of Matching Model |
CN107305541A (en) * | 2016-04-20 | 2017-10-31 | 科大讯飞股份有限公司 | Speech recognition text segmentation method and device |
CN107799126A (en) * | 2017-10-16 | 2018-03-13 | 深圳狗尾草智能科技有限公司 | Sound end detecting method and device based on Supervised machine learning |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103871401B (en) * | 2012-12-10 | 2016-12-28 | 联想(北京)有限公司 | A kind of method of speech recognition and electronic equipment |
-
2018
- 2018-05-17 CN CN201810473045.9A patent/CN110503943B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105700682A (en) * | 2016-01-08 | 2016-06-22 | 北京乐驾科技有限公司 | Intelligent gender and emotion recognition detection system and method based on vision and voice |
CN107305541A (en) * | 2016-04-20 | 2017-10-31 | 科大讯飞股份有限公司 | Speech recognition text segmentation method and device |
CN107146615A (en) * | 2017-05-16 | 2017-09-08 | 南京理工大学 | Audio recognition method and system based on the secondary identification of Matching Model |
CN107799126A (en) * | 2017-10-16 | 2018-03-13 | 深圳狗尾草智能科技有限公司 | Sound end detecting method and device based on Supervised machine learning |
Also Published As
Publication number | Publication date |
---|---|
CN110503943A (en) | 2019-11-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110503943B (en) | Voice interaction method and voice interaction system | |
KR101702829B1 (en) | Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination | |
CN110415705B (en) | Hot word recognition method, system, device and storage medium | |
EP3260996A1 (en) | Dialogue act estimation method, dialogue act estimation apparatus, and storage medium | |
KR102413692B1 (en) | Apparatus and method for caculating acoustic score for speech recognition, speech recognition apparatus and method, and electronic device | |
CN111797632B (en) | Information processing method and device and electronic equipment | |
CN112233680B (en) | Speaker character recognition method, speaker character recognition device, electronic equipment and storage medium | |
CN112735383A (en) | Voice signal processing method, device, equipment and storage medium | |
CN111191450B (en) | Corpus cleaning method, corpus input device and computer readable storage medium | |
KR20140042994A (en) | Machine learning based of artificial intelligence conversation system using personal profiling information to be extracted automatically from the conversation contents with the virtual agent | |
CN107564528B (en) | Method and equipment for matching voice recognition text with command word text | |
CN113506574A (en) | Method and device for recognizing user-defined command words and computer equipment | |
CN112579762B (en) | Dialogue emotion analysis method based on semantics, emotion inertia and emotion commonality | |
US11238289B1 (en) | Automatic lie detection method and apparatus for interactive scenarios, device and medium | |
CN111199149A (en) | Intelligent statement clarifying method and system for dialog system | |
KR101590908B1 (en) | Method of learning chatting data and system thereof | |
CN106708950B (en) | Data processing method and device for intelligent robot self-learning system | |
KR102429656B1 (en) | A speaker embedding extraction method and system for automatic speech recognition based pooling method for speaker recognition, and recording medium therefor | |
CN109065026B (en) | Recording control method and device | |
KR101444411B1 (en) | Apparatus and method for automated processing the large speech data based on utterance verification | |
CN106682642A (en) | Multi-language-oriented behavior identification method and multi-language-oriented behavior identification system | |
CN112466286A (en) | Data processing method and device and terminal equipment | |
KR102370437B1 (en) | Virtual Counseling System and counseling method using the same | |
CN115512687A (en) | Voice sentence-breaking method and device, storage medium and electronic equipment | |
CN111883109B (en) | Voice information processing and verification model training method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20200806 Address after: Susong Road West and Shenzhen Road North, Hefei Economic and Technological Development Zone, Anhui Province Applicant after: Weilai (Anhui) Holding Co.,Ltd. Address before: 30 Floor of Yihe Building, No. 1 Kangle Plaza, Central, Hong Kong, China Applicant before: NIO NEXTEV Ltd. |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |