CN110503943B - Voice interaction method and voice interaction system - Google Patents

Voice interaction method and voice interaction system Download PDF

Info

Publication number
CN110503943B
CN110503943B CN201810473045.9A CN201810473045A CN110503943B CN 110503943 B CN110503943 B CN 110503943B CN 201810473045 A CN201810473045 A CN 201810473045A CN 110503943 B CN110503943 B CN 110503943B
Authority
CN
China
Prior art keywords
voice
information
gender
segment
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810473045.9A
Other languages
Chinese (zh)
Other versions
CN110503943A (en
Inventor
孙珏
徐曼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NIO Holding Co Ltd
Original Assignee
NIO Anhui Holding Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NIO Anhui Holding Co Ltd filed Critical NIO Anhui Holding Co Ltd
Priority to CN201810473045.9A priority Critical patent/CN110503943B/en
Publication of CN110503943A publication Critical patent/CN110503943A/en
Application granted granted Critical
Publication of CN110503943B publication Critical patent/CN110503943B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The application relates to a voice interaction method and a voice interaction system. The method comprises the following steps: preprocessing the input voice information and outputting a voice segment; a semantic recognition step, namely performing semantic recognition on the voice segment output by the preprocessing step and outputting semantic information; a gender classification step, namely recognizing the gender of the user from the voice segment output by the preprocessing step and outputting gender information; and a fusion processing step of fusing the gender information and the semantic information to obtain personalized reply information for the voice information. According to the voice interaction method and the voice interaction system, the user can conduct distinguishing reply according to the gender of the user, user experience is improved, and the intelligentization of voice interaction is improved.

Description

Voice interaction method and voice interaction system
Technical Field
The present application relates to a voice recognition technology, and more particularly, to a voice interaction method and a voice interaction system capable of recognizing gender of a user.
Background
In a vehicle-mounted dialogue system, the existing voice recognition technology can recognize the voice of a user to a certain extent, but some topics relate to the gender of the user, and the existing voice recognition technology often has difficulty in giving an answer conforming to the gender of the user according to the recognized text.
The information disclosed in the background section of the application is only for enhancement of understanding of the general background of the application and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person of ordinary skill in the art.
Disclosure of Invention
In view of the above, the present application aims to provide a voice interaction method and a voice interaction system capable of recognizing the sex of a user.
The voice interaction method of the application is characterized by comprising the following steps:
preprocessing the input voice information and outputting a voice segment;
a semantic recognition step, namely performing semantic recognition on the voice segment output by the preprocessing step and outputting semantic information;
a gender classification step, namely recognizing the gender of the user from the voice segment output by the preprocessing step and outputting gender information; and
and a fusion processing step of fusing the gender information and the semantic information to obtain personalized reply information for the voice information.
Optionally, the gender analysis step includes:
a model training sub-step of training a long-short-time memory model based on the output acoustic characteristics of the filter and the pre-labeled gender information to obtain the long-short-time memory model; and
and a gender classification sub-step, namely inputting the voice segment into a long-short-time memory model obtained through training and outputting gender classification.
Optionally, in the preprocessing step, an endpoint detection algorithm is used to detect a speech segment for the input speech information.
Optionally, in the preprocessing step, for the input voice information, an endpoint detection algorithm is used to detect a voice segment and output a first voice segment provided for the semantic recognition step and a second voice segment provided for the gender classification step, wherein an endpoint detection boundary of the second voice segment is more strict than an endpoint detection boundary of the first voice segment.
Optionally, the model training substep comprises:
preparing a training set with gender labeling;
extracting output acoustic features of a filter of the training set;
constructing a labeling file corresponding to the output acoustic characteristics of the filter; and
and inputting the output acoustic characteristics of the filter and the annotation file into a long-time and short-time memory model for model training until the model converges.
Optionally, the sex classification substep includes:
inputting the voice segment into a long-short-time memory model obtained through training;
forward calculation is carried out to obtain posterior probabilities of different classification sexes; and
the posterior probability for a predetermined period of time is accumulated to obtain a gender classification result.
The voice interaction system of the present application is characterized by comprising:
the preprocessing module is used for preprocessing the input voice information and outputting voice segments;
the semantic recognition module is used for carrying out semantic recognition on the voice segment output by the preprocessing module and outputting semantic information;
the gender classification module is used for classifying the gender of the voice segment output by the preprocessing module, identifying the gender of the user and outputting gender information; and
and the fusion processing module is used for fusing the gender information and the semantic information to obtain personalized reply information for the voice information.
Optionally, the gender classification module includes:
the model training sub-module is used for training the long-short-time memory model based on the output acoustic characteristics of the filter and the pre-labeled gender information to obtain the long-short-time memory model; and
and the gender classification sub-module is used for inputting the voice segment into the long-short-time memory model obtained through training and outputting gender classification.
Optionally, in the preprocessing module, for the input voice information, an endpoint detection algorithm is used to detect a voice segment.
Optionally, the preprocessing module performs voice segment detection on the input voice information using an endpoint detection algorithm and outputs a first voice segment provided to the semantic recognition module and a second voice segment provided to the gender classification module,
wherein the end-point detection boundary of the second speech segment is more stringent than the end-point detection boundary of the first speech segment.
Optionally, the model training submodule extracts the output acoustic characteristics of the filter of the training set based on the training set with gender labeling, constructs a labeling file corresponding to the output acoustic characteristics of the filter, and inputs the output acoustic characteristics of the filter and the labeling file into a long-time and short-time memory model for model training until the model converges.
Optionally, the gender classification sub-module inputs the voice segment into a long-short-time memory model obtained through training, obtains posterior probabilities of different classification sexes through forward calculation, and accumulates the posterior probabilities of a specified time to obtain gender classification results.
The voice interaction method of the application is applied to a vehicle, or the voice interaction system of the application is applied to a vehicle.
The application also provides voice interaction equipment which can execute the voice interaction method or comprises the voice interaction system.
Optionally, the voice interaction device is disposed on a vehicle.
The application provides a controller which comprises a storage component, a processing component and an instruction which is stored on the storage component and can be operated by the processing component, and is characterized in that the processing component realizes the voice interaction method when the instruction is operated. According to the voice interaction method and the voice interaction system, by combining semantic analysis and gender classification, the distinction reply can be carried out according to the gender of the user, the user experience is improved, and the intelligentization of voice interaction is improved.
Other features and advantages of the methods and apparatus of the present application will be apparent from or elucidated with reference to the drawings, taken in conjunction with the accompanying drawings, and the detailed description which follows in conjunction with the accompanying drawings, serve to illustrate certain principles of the application.
Drawings
Fig. 1 is a flowchart showing a voice interaction method according to an embodiment of the present application.
Fig. 2 is a schematic illustration of a specific flow of the gender classification step.
Fig. 3 is a block diagram showing the construction of a voice interaction system according to an embodiment of the present application.
Detailed Description
The following presents a simplified summary of the application in order to provide a basic understanding of the application. It is not intended to identify key or critical elements of the application or to delineate the scope of the application.
First, some terms that will appear hereinafter will be explained.
nlu: natural language understanding;
asr: automatic speech recognition;
long and short term memory model (LSTM): a long-time short-time memory model, a deep learning model, which can learn long-term dependence information;
features: filter bank characteristic parameters of the audio file;
cmvn: statistical information of the characteristic files;
gmm-hmm: one conventional acoustic model, a hidden markov model based on a mixture gaussian model.
Fig. 1 is a flowchart of a voice interaction method according to an embodiment of the present application.
Referring to fig. 1, the voice interaction method according to an embodiment of the present application includes the following steps:
input step S100: inputting voice information;
pretreatment step S200: preprocessing the voice information input in the input step S100 and outputting voice segments;
semantic recognition step S300: carrying out semantic recognition on the voice segment output by the preprocessing step S200 and outputting semantic information;
gender classification step S400: performing gender classification on the voice segment output by the preprocessing step S200, identifying the gender of the user and outputting gender information;
fusion processing step S500: fusing the gender information and the semantic information to obtain personalized reply information for the input voice information; and
an output step 600: to output the personalized reply message. For example, the output may be in a voice manner or in a text manner.
Next, an exemplary explanation is given of the preprocessing step S200, the sex classification step S400, and the fusion processing step S500. In the semantic recognition step S300, the semantic recognition of the speech segment and the output of the semantic information may be performed by the same technical means as those of the conventional technique, and the description thereof will be omitted.
As an example, in a preprocessing step S200, for the input speech information, an end-point detection algorithm (VAD) is used to detect the speech information to obtain speech segments. For example, the voice information of the user is input into a VAD model, which obtains the voice segments by means of endpoint detection, feature extraction, etc. The obtained speech segments are provided to the subsequent semantic recognition step S300 and the gender classification step S400, respectively. The voice recognition task requires that complete text information is reserved as far as possible, and the boundary of the VAD should be more tolerant; while the gender classification task requires that all silence (silence) be eliminated as much as possible, the boundaries of the VAD should be more stringent. Thus, two different speech segments are optionally provided separately to the subsequent semantic recognition step S300 and the gender classification step S400 at the preprocessing step S200.
Next, a sex classification step S400 will be described.
Fig. 2 is a specific flowchart of the sex classification step S400.
As shown in fig. 2, the gender classification step S400 may be roughly divided into a training phase and an identification phase.
First, a training phase will be described.
A training set with gender labeling needs to be prepared as training samples, including wav.scp, utt spline, text, and gender information corresponding to each language (utterance), features of the training set (i.e., filter bank feature parameters of audio files) are extracted, output acoustic features of filters (i.e., filter bank feature, output acoustic feature of filters) and cmvn in fig. 2 are prepared for training long and short time memory models.
Because the gender model is a classification model, the annotation files (namely, the FAs in fig. 2) corresponding to the features need to be constructed, and the annotation files FA are only aimed at the voice segments of the features, and a batch of annotation files FA reflecting the gender of the feature files are constructed according to the frame number of the features.
And inputting the prepared feature files feats and the annotation files FA into the long-short-time memory model for training until convergence. Here, LSTM (Long-Short Term Memory) is one of recurrent neural networks (RNN: recurrent Neutral Network). RNNs, also called recurrent neural network sequences, are special neural networks that call themselves according to time sequences or character sequences, which are developed in sequence to become common three-layer neural networks, often used for speech recognition.
Here, the basic parameters adopted by the long-short-time memory model are:
num-lstm-layers: 1;
cell-dim: 1024;
lstm-delay: -1。
next, the identification phase will be described.
First, feature extraction is required. When a user speaks, the speech information is first detected using an end-point detection algorithm (VAD), and feature extraction is performed on non-silence (non-silence) speech frames detected by the VAD. Since the long-short-term memory model is a model depending on the past time, a buffer may be provided for feature accumulation.
Then, forward computation is performed. And sending the feature matrix with a certain length into a long-short time memory model, and obtaining posterior probabilities of different classification sexes through forward calculation. The posterior probability is a probability of being corrected again after obtaining information of "result", and is a "fruit" in the problem of "cause of execution". The probability that a thing has not yet occurred, which is required to be the prior probability; what has happened, the reason why this is required to happen is the magnitude of the probability caused by a certain factor, namely the posterior probability.
Finally, posterior processing is performed. And setting a time threshold T through repeated experiments, comparing posterior probability values of accumulated T duration, and taking the category with a larger probability value as the gender classification result of the input audio. Here, the time threshold T may be, for example, 0.5s or 1 s. The time threshold T cannot be set too long because more data would be needed, the real-time nature of the recognition becomes not high, but also cannot be set too short, as the accuracy may not be high enough.
In this way, the voice segment is semantically recognized and semantic information is output through the semantic recognition step S300, on the other hand, the voice segment is sexually classified and the user sex is recognized and the sex information is output through the sex classification step S400, and then the recognized sex information and semantic information are fused in the fusion processing step S500, so that personalized reply information for the input voice information is obtained. In some examples of the present application, the "fusion" mentioned in step S500 may be understood as taking into account the gender information obtained in step S400 when performing the voice interaction information, for example, to make the reply more targeted or more appropriate, as several examples are given below. But the case of other application of neutral information in step S400 is not excluded.
For example, when the voice input by the user is "get good early-! "when the sex classification step S400 is recognized as a male, then" mr. Is output-! "when the female is identified in the sex classification step S400," female is output! "; when the voice input by the user is "do you feel me nice", and when the sex classification step S400 is recognized as a male, then "of course, you are marshal go-! "when the female is identified in the sex classification step S400," of course, you are a large beauty-! "; when the voice input by the user is "now several points", when the sex classification step S400 is recognized as a male, then "mr. Is now 3 pm" is output, and when the sex classification step S400 is recognized as a female, then "woman is now 3 pm" is output.
The embodiments of the voice interaction method of the present application are described above. Next, for the voice interactive system of the present application,
fig. 3 is a block diagram showing the construction of a voice interaction system according to an embodiment of the present application.
As shown in fig. 3, a voice interaction system according to an embodiment of the present application includes:
an input module 100 for inputting voice information;
the preprocessing module 200 is used for receiving and preprocessing voice information and outputting voice segments;
the gender classification module 300 is used for performing gender classification on the voice segment output by the preprocessing module, identifying the gender of the user and outputting gender information;
the semantic recognition module 400 is used for carrying out semantic recognition on the voice segment output by the preprocessing module and outputting semantic information;
the fusion processing module 500 is configured to fuse the gender information and the semantic information to obtain personalized reply information for the voice information; and
and the output module 600 is used for outputting the personalized reply information in a voice way.
The preprocessing module 200 performs voice segment detection using an end point detection algorithm (VAD) on the input voice information, and in particular, the preprocessing module 200 performs voice segment detection using an end point detection algorithm on the input voice information and outputs a first voice segment provided to the gender classification module 300 and a second voice segment provided to the semantic recognition module 400, wherein boundaries of the VAD should be stricter because the gender classification module requires that all silence segments be removed as much as possible, and boundaries of the VAD should be more tolerant because the semantic recognition module 400 requires that complete text information be preserved as much as possible, and therefore, the end point detection boundaries of the first voice segment are stricter than the end point detection boundaries of the second voice segment.
Wherein the gender classification module 300 comprises:
the model training sub-module 310 is configured to perform long-short-time memory model training based on the output acoustic features of the filter and pre-labeled gender information to obtain a long-short-time memory model; and
the gender classification sub-module 320 is configured to input the speech segment into a long-short-term memory model obtained through training and output gender classification.
The model training submodule 410 extracts output acoustic features of a filter of the training set based on the training set with gender labeling, constructs a labeling file FA corresponding to the output acoustic features of the filter, and inputs the output acoustic features of the filter and the labeling file into a long-time and short-time memory model for model training until the model converges. The gender classification sub-module 420 inputs the voice segments into a long and short time memory model obtained through training, obtains posterior probabilities of different classification sexes through forward calculation, and accumulates the posterior probabilities for a prescribed time period to obtain gender classification results.
The voice interaction method described in any of the above examples can be applied to a vehicle, or the voice interaction system described in any of the above examples can be applied to a vehicle. For example, as part of a vehicle control method or vehicle control system.
The present application also provides a voice interaction device capable of performing the voice interaction method as described in any of the examples above; alternatively, it comprises a voice interaction system as described in any of the examples above. The voice interaction device can be implemented separately as a component which can be provided in a vehicle, for example, so that a person in the vehicle can interact with it in voice. The voice interaction device may be a device fixed to the vehicle or a device capable of being taken from/put back into the vehicle. And further, in some examples, the voice interaction device is capable of communicating with an electronic control system within the vehicle. In some cases, the voice interaction device may also be implemented in an existing electronic component of the vehicle, such as an infotainment system of the vehicle, etc.
The application also provides a controller which comprises a storage component, a processing component and an instruction which is stored on the storage component and can be operated by the processing component, and is characterized in that the processing component realizes the voice interaction method when the instruction is operated.
According to the voice interaction method and the voice interaction system of the examples, by combining semantic analysis and gender classification, the user can conduct distinguishing reply according to the gender of the user, user experience is improved, and the intelligentization of voice interaction is improved.
The above examples mainly illustrate the voice interaction method and the voice interaction system of the present application. Although only a few specific embodiments of the present application have been described, those skilled in the art will appreciate that the present application may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and the application is intended to cover various modifications and substitutions without departing from the spirit and scope of the application as defined by the appended claims.

Claims (12)

1. A method of voice interaction, comprising:
preprocessing the input voice information and outputting a first voice segment and a second voice segment;
a semantic recognition step, namely performing semantic recognition on the first voice segment output by the preprocessing step and outputting semantic information;
a gender classification step, namely identifying the gender of the user and outputting gender information to the second voice segment output by the preprocessing step; and
a fusion processing step of fusing the sex information and the semantic information to obtain personalized reply information to the voice information,
wherein in the preprocessing step, for the input voice information, detection of a voice segment is performed using an end point detection algorithm and the first voice segment supplied to the semantic recognition step and the second voice segment supplied to the gender classification step are output,
wherein the first speech segment is different from the second speech segment, wherein the first speech segment is provided such that complete text information is preserved and the second speech segment is provided such that all silence is rejected, and wherein the end point detection boundary of the second speech segment is more stringent than the end point detection boundary of the first speech segment.
2. The voice interaction method of claim 1, wherein the gender classification step comprises:
a model training sub-step of training a long-short-time memory model based on the output acoustic characteristics of the filter and the pre-labeled gender information to obtain the long-short-time memory model; and
and a gender classification sub-step, namely inputting the voice segment into a long-short-time memory model obtained through training and outputting gender classification.
3. The voice interaction method of claim 2, wherein the model training substep comprises:
preparing a training set with gender labeling;
extracting output acoustic features of a filter of the training set;
constructing a labeling file corresponding to the output acoustic characteristics of the filter; and
and inputting the output acoustic characteristics of the filter and the annotation file into a long-time and short-time memory model for model training until the model converges.
4. The voice interaction method of claim 2, wherein the gender classification sub-step comprises:
inputting the voice segment into a long-short-time memory model obtained through training;
forward calculation is carried out to obtain posterior probabilities of different classification sexes; and
the posterior probability for a predetermined period of time is accumulated to obtain a gender classification result.
5. A voice interactive system, comprising:
the preprocessing module is used for preprocessing the input voice information and outputting a first voice segment and a second voice segment;
the semantic recognition module is used for carrying out semantic recognition on the first voice segment output by the preprocessing module and outputting semantic information;
the gender classification module is used for carrying out gender classification on the second voice segment output by the preprocessing module, identifying the gender of the user and outputting gender information; and
a fusion processing module for fusing the gender information and the semantic information to obtain personalized reply information for the voice information,
wherein, in the preprocessing module, for the input voice information, a voice segment is detected using an end point detection algorithm and the first voice segment provided to the semantic recognition module and the second voice segment provided to the gender classification module are output,
wherein the first speech segment is different from the second speech segment, wherein the first speech segment is provided such that complete text information is preserved and the second speech segment is provided such that all silence is rejected, and wherein the end point detection boundary of the second speech segment is more stringent than the end point detection boundary of the first speech segment.
6. The voice interactive system of claim 5, wherein the gender classification module comprises:
the model training sub-module is used for training the long-short-time memory model based on the output acoustic characteristics of the filter and the pre-labeled gender information to obtain the long-short-time memory model; and
and the gender classification sub-module is used for inputting the voice segment into the long-short-time memory model obtained through training and outputting gender classification.
7. The voice interactive system of claim 6, wherein,
and the model training submodule extracts the output acoustic characteristics of the filter of the training set based on the training set with gender marking, constructs a marking file corresponding to the output acoustic characteristics of the filter, and inputs the output acoustic characteristics of the filter and the marking file into a long-time and short-time memory model for model training until the model converges.
8. The voice interactive system of claim 5, wherein the gender classification sub-module inputs the voice segments into a long-short-term memory model obtained through training, obtains posterior probabilities of different classification sexes through forward calculation and accumulates posterior probabilities for a prescribed time period to obtain gender classification results.
9. A voice interaction method as claimed in any one of claims 1 to 4 or a voice interaction system as claimed in any one of claims 5 to 8 for use in a vehicle.
10. A voice interaction device capable of performing the voice interaction method of any one of claims 1 to 4 or comprising the voice interaction system of any one of claims 5 to 8.
11. The voice interaction device of claim 10, disposed on a vehicle.
12. A controller comprising a storage means, a processing means and instructions stored on the storage means and executable by the processing means, wherein the processing means implements the voice interaction method of any one of claims 1 to 4 when the instructions are executed.
CN201810473045.9A 2018-05-17 2018-05-17 Voice interaction method and voice interaction system Active CN110503943B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810473045.9A CN110503943B (en) 2018-05-17 2018-05-17 Voice interaction method and voice interaction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810473045.9A CN110503943B (en) 2018-05-17 2018-05-17 Voice interaction method and voice interaction system

Publications (2)

Publication Number Publication Date
CN110503943A CN110503943A (en) 2019-11-26
CN110503943B true CN110503943B (en) 2023-09-19

Family

ID=68583957

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810473045.9A Active CN110503943B (en) 2018-05-17 2018-05-17 Voice interaction method and voice interaction system

Country Status (1)

Country Link
CN (1) CN110503943B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883133B (en) * 2020-07-20 2023-08-29 深圳乐信软件技术有限公司 Customer service voice recognition method, customer service voice recognition device, server and storage medium
CN112397067A (en) * 2020-11-13 2021-02-23 重庆长安工业(集团)有限责任公司 Voice control terminal of weapon equipment
CN113870861A (en) * 2021-09-10 2021-12-31 Oppo广东移动通信有限公司 Voice interaction method and device, storage medium and terminal
CN116092056B (en) * 2023-03-06 2023-07-07 安徽蔚来智驾科技有限公司 Target recognition method, vehicle control method, device, medium and vehicle

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105700682A (en) * 2016-01-08 2016-06-22 北京乐驾科技有限公司 Intelligent gender and emotion recognition detection system and method based on vision and voice
CN107146615A (en) * 2017-05-16 2017-09-08 南京理工大学 Audio recognition method and system based on the secondary identification of Matching Model
CN107305541A (en) * 2016-04-20 2017-10-31 科大讯飞股份有限公司 Speech recognition text segmentation method and device
CN107799126A (en) * 2017-10-16 2018-03-13 深圳狗尾草智能科技有限公司 Sound end detecting method and device based on Supervised machine learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103871401B (en) * 2012-12-10 2016-12-28 联想(北京)有限公司 A kind of method of speech recognition and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105700682A (en) * 2016-01-08 2016-06-22 北京乐驾科技有限公司 Intelligent gender and emotion recognition detection system and method based on vision and voice
CN107305541A (en) * 2016-04-20 2017-10-31 科大讯飞股份有限公司 Speech recognition text segmentation method and device
CN107146615A (en) * 2017-05-16 2017-09-08 南京理工大学 Audio recognition method and system based on the secondary identification of Matching Model
CN107799126A (en) * 2017-10-16 2018-03-13 深圳狗尾草智能科技有限公司 Sound end detecting method and device based on Supervised machine learning

Also Published As

Publication number Publication date
CN110503943A (en) 2019-11-26

Similar Documents

Publication Publication Date Title
CN110503943B (en) Voice interaction method and voice interaction system
KR101702829B1 (en) Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination
CN110415705B (en) Hot word recognition method, system, device and storage medium
EP3260996A1 (en) Dialogue act estimation method, dialogue act estimation apparatus, and storage medium
KR102413692B1 (en) Apparatus and method for caculating acoustic score for speech recognition, speech recognition apparatus and method, and electronic device
CN111797632B (en) Information processing method and device and electronic equipment
CN112233680B (en) Speaker character recognition method, speaker character recognition device, electronic equipment and storage medium
CN112735383A (en) Voice signal processing method, device, equipment and storage medium
CN111191450B (en) Corpus cleaning method, corpus input device and computer readable storage medium
KR20140042994A (en) Machine learning based of artificial intelligence conversation system using personal profiling information to be extracted automatically from the conversation contents with the virtual agent
CN107564528B (en) Method and equipment for matching voice recognition text with command word text
CN113506574A (en) Method and device for recognizing user-defined command words and computer equipment
CN112579762B (en) Dialogue emotion analysis method based on semantics, emotion inertia and emotion commonality
US11238289B1 (en) Automatic lie detection method and apparatus for interactive scenarios, device and medium
CN111199149A (en) Intelligent statement clarifying method and system for dialog system
KR101590908B1 (en) Method of learning chatting data and system thereof
CN106708950B (en) Data processing method and device for intelligent robot self-learning system
KR102429656B1 (en) A speaker embedding extraction method and system for automatic speech recognition based pooling method for speaker recognition, and recording medium therefor
CN109065026B (en) Recording control method and device
KR101444411B1 (en) Apparatus and method for automated processing the large speech data based on utterance verification
CN106682642A (en) Multi-language-oriented behavior identification method and multi-language-oriented behavior identification system
CN112466286A (en) Data processing method and device and terminal equipment
KR102370437B1 (en) Virtual Counseling System and counseling method using the same
CN115512687A (en) Voice sentence-breaking method and device, storage medium and electronic equipment
CN111883109B (en) Voice information processing and verification model training method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200806

Address after: Susong Road West and Shenzhen Road North, Hefei Economic and Technological Development Zone, Anhui Province

Applicant after: Weilai (Anhui) Holding Co.,Ltd.

Address before: 30 Floor of Yihe Building, No. 1 Kangle Plaza, Central, Hong Kong, China

Applicant before: NIO NEXTEV Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant