CN113362815A - Voice interaction method, system, electronic equipment and storage medium - Google Patents

Voice interaction method, system, electronic equipment and storage medium Download PDF

Info

Publication number
CN113362815A
CN113362815A CN202110707954.6A CN202110707954A CN113362815A CN 113362815 A CN113362815 A CN 113362815A CN 202110707954 A CN202110707954 A CN 202110707954A CN 113362815 A CN113362815 A CN 113362815A
Authority
CN
China
Prior art keywords
text
meaningless
model
text information
trained
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110707954.6A
Other languages
Chinese (zh)
Inventor
李翠姣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cloudminds Robotics Co Ltd
Original Assignee
Cloudminds Robotics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cloudminds Robotics Co Ltd filed Critical Cloudminds Robotics Co Ltd
Priority to CN202110707954.6A priority Critical patent/CN113362815A/en
Publication of CN113362815A publication Critical patent/CN113362815A/en
Priority to PCT/CN2021/140759 priority patent/WO2022267405A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention relates to the technical field of voice interaction, and discloses a voice interaction method, a system, electronic equipment and a storage medium, wherein the voice interaction method comprises the following steps: acquiring text information obtained after a voice signal is subjected to automatic speech recognition ASR processing, wherein the voice signal is a sound signal acquired from the environment; performing feature extraction on the text information to obtain a feature vector of the text information; inputting the feature vector into a trained meaningless text recognition model, and judging whether the text information is a meaningless text according to an output result of the meaningless text recognition model, wherein the meaningless text is a text which does not conform to a conventional expression mode; if the text information is not a meaningless text, the text information is responded after the trained response judgment model detects that the text information needs to be responded. The phenomenon that electronic equipment answers in disorder and does not answer continuously can be avoided under the condition that the ASR processing precision does not need to be improved, and the response effect under the noisy environment is improved.

Description

Voice interaction method, system, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of voice interaction, in particular to a voice interaction method, a voice interaction system, electronic equipment and a storage medium.
Background
Generally, after an electronic device such as a robot acquires a Speech signal from an environment, the Speech signal is converted into Text information by using Automatic Speech Recognition (ASR) Processing, then Natural Language Understanding (NLU) Processing is performed on the Text information To acquire an intention contained in the Text information and determine response content corresponding To the intention, then the response content is converted from a Text To Speech (TTS) Processing To a Speech, and finally the Speech is output To complete Speech interaction. Because the NLU processing and the TTS processing are both performed based on the text information obtained by the ASR processing, the response effect of the speech interaction can be directly influenced by the effect of the ASR processing result. In an actual application scene, the environment where the voice interaction is performed is usually noisy, and interferences such as noise and the like are unavoidable, and particularly in a public environment such as an airport, a hospital and the like, the sound in the environment is noisier and the interference is larger. Under noisy environment, the obtained speech signal will include many background noises, such as chat information of surrounding people, environmental noise, etc., and then the speech signal is converted into text information through ASR processing, and the background noise is converted into the text information at the same time, so that the ASR processing effect is poor, and the problems of indiscriminate answer and continuous answer of electronic equipment occur. One possible solution is to continually improve the accuracy and precision of ASR processing to reduce noise input.
However, from the processing result of the ASR model with high precision and accuracy in the prior art, the problem that the electronic device answers randomly and continuously is still not solved, and it is difficult to continuously improve the precision and accuracy of the ASR processing, so a new speech interaction method is urgently needed to avoid the phenomenon that the electronic device answers randomly and continuously and improve the response effect in the noisy environment.
Disclosure of Invention
Embodiments of the present invention provide a voice interaction method, system, electronic device, and storage medium, so that the phenomena of indiscriminate answer and non-stop answer of the electronic device can be avoided without improving ASR processing precision and accuracy, and the response effect in a noisy environment is improved.
In order to achieve the above object, an embodiment of the present invention provides a voice interaction method, including the following steps: acquiring text information obtained after a voice signal is subjected to automatic speech recognition ASR processing, wherein the voice signal is a sound signal acquired from an environment; extracting the features of the text information to obtain a feature vector of the text information; inputting the feature vector into a trained meaningless text recognition model, and judging whether the text information is a meaningless text according to an output result of the meaningless text recognition model, wherein the meaningless text is a text which does not conform to a conventional expression mode; and if the text information is not the meaningless text, responding the text information after detecting that the text information needs to be responded by using the trained response judgment model.
The embodiment of the invention also provides a voice interaction system, which comprises: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring text information obtained by processing a voice signal through Automatic Speech Recognition (ASR), and the voice signal is a sound signal acquired from the environment; the characteristic extraction module is used for extracting the characteristics of the text information to obtain a characteristic vector of the text information; the meaning judgment module is used for inputting the feature vectors into a trained meaningless text recognition model and judging whether the text information is a meaningless text according to an output result of the meaningless text recognition model, wherein the meaningless text is a text which does not conform to a conventional expression mode; and the response module is used for responding the text information after detecting that the text information needs to be responded by using the trained response judgment model if the text information is not the meaningless text.
To achieve the above object, an embodiment of the present invention further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the voice interaction method described above.
To achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium storing a computer program, which when executed by a processor implements the voice interaction method described above.
According to the voice interaction method provided by the embodiment of the invention, after the text information obtained by processing the voice signal in the environment through the automatic voice recognition ASR is obtained, the natural voice understanding NLU is not directly used for intention understanding and response, but the feature extraction is firstly carried out on the text information, so that the feature vector of the text information obtained through the feature extraction can be used as the input of a trained meaningless text recognition model, and whether the text information is a meaningless text is judged according to the output result of the meaningless text recognition model. Therefore, the text information obtained by ASR processing is detected before response, response is only carried out if the text information is determined to be meaningful and needs to be responded, so that the information to be responded is the text information which is meaningful and needs to be responded, response errors caused by the fact that the text information comprises the text corresponding to the noise when the noise exists are eliminated, and the situation that continuous response is caused because the sound information such as the noise is taken as the voice information is avoided, namely, response errors or response to the noise is avoided, and the response effect in a noisy environment is improved during voice interaction.
In addition, in the voice interaction method provided in the embodiment of the present invention, before the feature vector is input into the trained meaningless text recognition model, the method further includes: constructing an initial training set containing the meaningless text and the meaningful text simultaneously; extracting features of the meaningless text and the meaningful text contained in the initial training set, and taking the obtained feature vectors as an identification training set; and training the meaningless text recognition model by using the recognition training set to obtain the trained meaningless text recognition model. The extracted characteristics of the meaningless text and the meaningful text are used for training instead of the meaningless text and the meaningful text, so that the training sample can reflect the characteristics of the text in the aspect of meaning, other characteristics of the text cannot be excessively referred, the model can identify the meaning more accurately, and the identification effect of the meaningless text identification model is improved.
In addition, the voice interaction method provided by the embodiment of the present invention, performing feature extraction, includes:
respectively extracting features from a plurality of dimensions by using an LSTM model, a Unigram model and a BERT model in a natural language understanding NLU model; before the feature extraction, the method further comprises: constructing a BERT training set simultaneously containing the meaningless text and the meaningful text; training the BERT model by using the BERT training set; the Unigram model and the LSTM model are trained using an open source dataset. When the feature extraction is carried out, feature vectors of multiple dimensions can be obtained instead of single confusion and classification results, so that more information can be referred when the text information is identified in a meaningful way, and the accuracy and the recall rate of the model can be improved. And training the BERT model by using a training set consisting of meaningful texts and meaningless texts, so that the BERT model can sense whether texts are meaningful or not, and the perception degree of output results of the BERT model in the meaningful direction or not is improved.
In addition, in the voice interaction method provided by the embodiment of the present invention, the manners of acquiring the meaningless text in the initial training set and the BERT training set include: acquiring texts which do not conform to the conventional expression mode and texts which conform to the conventional expression mode from a corpus; randomly performing an adjusting operation on the text conforming to the conventional expression mode, wherein the adjusting operation comprises one or a combination of the following operations: disorder processing, cutting processing and splicing processing; and taking the adjusted texts conforming to the conventional expression mode and the texts not conforming to the conventional expression mode as the meaningless texts in the initial training set and the BERT training set. New meaningless texts are generated by directly cutting, disordering the sequence and splicing the meaningful texts without manual judgment and labeling, and the capacity of the data set is enlarged without increasing the consumption of human resources.
In addition, according to the voice interaction method provided by the embodiment of the present invention, the intersection of the BERT training set and the initial training set is an empty set. The output result of the BERT model is similar to the input result, and the output result of the BERT model is included in a plurality of characteristics input by the meaningless text recognition model, so that when the BERT model and the meaningless text recognition model are trained by using the same data set, data used for training the BERT model and data used for training the meaningless text recognition model have certain content coincidence, and further the trained meaningless text recognition model has an overfitting problem.
In addition, according to the voice interaction method provided by the embodiment of the invention, the meaningless text recognition model is an extreme gradient lifting XGboost model. The XGboost model can process a plurality of characteristics simultaneously, namely a plurality of characteristics of one text can be used when meaningless text recognition is carried out, and then the recognition accuracy and the model recall rate are greatly improved due to the fact that the plurality of characteristics of the text are used for recognition.
In addition, in the voice interaction method provided in the embodiment of the present invention, the response determination model is a FastText model, and before the trained response determination model detects that the text information needs to be responded, the method further includes: and training the FastText model by utilizing a pre-constructed response data set to obtain the trained FastText model, wherein the response data set comprises texts needing to be responded and texts not needing to be responded.
Drawings
One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.
FIG. 1 is a flow chart of a method of voice interaction in an embodiment of the present invention;
FIG. 2 is a flow chart of a method of voice interaction including a forgoing response step in another embodiment of the present invention;
FIG. 3 is a flow chart of a method of voice interaction including the step of constructing an initial training set in another embodiment of the present invention;
FIG. 4 is a flow chart of a method of voice interaction including the step of constructing a BERT training set in another embodiment of the present invention;
FIG. 5 is a flowchart of the step of constructing a BERT data set and obtaining meaningless text in an initial data set in a voice interaction method according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a voice interaction system in another embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device in another embodiment of the invention.
Detailed Description
As can be known from the background art, in the related art, Speech interaction is performed through Processing such as Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), Text-To-Speech (TTS), and the like, wherein the accuracy of the Recognition result of the ASR module directly affects the response effect in the Speech interaction process. In order to improve the response effect, a common method is to improve the accuracy of ASR processing, but the accuracy of the existing SAR model is relatively high, but the phenomena of indiscriminate answer and continuous answer of electronic equipment still cannot be solved, and it is also difficult to continuously improve the accuracy of ASR processing. Therefore, it is desirable to provide a new voice interaction method to avoid the phenomena of indiscriminate answer and continuous answer of the electronic device, and improve the response effect in the noisy environment.
In order to avoid the phenomena of indiscriminate answer and ceaseless answer of electronic equipment and improve the response effect in a noisy environment, the embodiment of the invention provides a voice interaction method, which comprises the following steps: acquiring text information obtained after a voice signal is subjected to automatic speech recognition ASR processing, wherein the voice signal is a sound signal acquired from an environment; extracting the features of the text information to obtain a feature vector of the text information; inputting the feature vector into a trained meaningless text recognition model, and judging whether the text information is a meaningless text according to an output result of the meaningless text recognition model, wherein the meaningless text is a text which does not conform to a conventional expression mode; and if the text information is not the meaningless text, responding the text information after detecting that the text information needs to be responded by using the trained response judgment model.
According to the voice interaction method provided by the embodiment of the invention, after the text information obtained by processing the voice signal in the environment through the automatic voice recognition ASR is obtained, the natural voice understanding NLU is not directly used for intention understanding and response, but the feature extraction is firstly carried out on the text information, so that the feature vector of the text information obtained through the feature extraction can be used as the input of a trained meaningless text recognition model, and whether the text information is a meaningless text is judged according to the output result of the meaningless text recognition model. Therefore, the text information obtained by ASR processing is detected before response, response is only carried out if the text information is determined to be meaningful and needs to be responded, so that the information to be responded is the text information which is meaningful and needs to be responded, response errors caused by the fact that the text information comprises the text corresponding to the noise when the noise exists are eliminated, and the situation that continuous response is caused because the sound information such as the noise is taken as the voice information is avoided, namely, response errors or response to the noise is avoided, and the response effect in a noisy environment is improved during voice interaction.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.
The following describes in detail the implementation details of the method for detecting the filter characteristic of the present embodiment with reference to fig. 1 to 5, and the following description is provided only for the convenience of understanding and is not necessary for implementing the present embodiment.
Referring to fig. 1, in some embodiments, the voice interaction method is applied to an electronic device capable of voice interaction, such as a robot, a tablet, and the like, and specifically includes:
step 101, acquiring text information obtained after the speech signal is processed by automatic speech recognition ASR.
Specifically, a speech signal is acquired from the environment, and then ASR processing is performed on the speech signal to convert the speech signal into text information.
It should be noted that, in this embodiment, the voice signal is a voice signal obtained from an environment, the voice signal not only includes a voice instruction of the user, but also may include other voice contents in the environment where the user is located, for example, when a certain user has a requirement to play a song, a voice instruction "play song a" is issued near the robot, other users are talking around the user, and the talking contents include "good and unseen", the robot may capture the voice instruction mixed with the talking sound of the other user from the environment as "play good and unseen song a".
And 102, extracting the characteristics of the text information to obtain a characteristic vector of the text information.
In this embodiment, the ULR model is used to extract a feature vector of the text information, and further, in some embodiments, the ULR model includes: LSTM and unicgram models in the language model, and BERT model in the language representation model, where step 102 actually is: and respectively extracting features from a plurality of dimensions by using an LSTM model, a Unigram model and a BERT model in the natural language understanding NLU model to obtain feature vectors of the text information on the plurality of dimensions.
It is worth mentioning that a plurality of models, namely an LSTM model, a Unigram model and a BERT model, are utilized during feature extraction, and feature vectors of a plurality of dimensions can be obtained instead of single confusion and classification results, so that more information can be referred to when the text information is identified whether to be meaningful or not, and the accuracy and the recall rate of the models can be improved.
In one example, a feature vector of text information in multiple dimensions may be represented by the following expression:
bert_prob=BERT(S)
Figure BDA0003132083580000061
Figure BDA0003132083580000062
Figure BDA0003132083580000063
Figure BDA0003132083580000064
Figure BDA0003132083580000065
Figure BDA0003132083580000066
Figure BDA0003132083580000067
Figure BDA0003132083580000068
Figure BDA0003132083580000071
wherein S represents a text, | S | represents a text length, N represents a vocabulary number,<x>representing orientation, and BERT (S) representing probability value obtained by a BERT model; pm(S)=P(wn|w1,w2,……,wn-1) Representing the probability, w, of occurrence of the text S in the LSTM language modelnIs the vocabulary in S; pu(S)=P(w1)P(w2)…P(wn) Representing the probability of occurrence, w, of the text S in the Unigram language modelnIs the vocabulary in S; pm(w)=P(wi|w1,w2… …, w i-1) represents the probability of the occurrence of the current word w in the LSTM language model S, w ═ wi;Pu(w)=P w1)P(w2)…P(wn) Representing the probability of occurrence of the current word w in the Unigram language model S, w ═ wi
Further, in some embodiments, the nonsense text recognition model is a very gradient boosting XGBoost model.
It is worth mentioning that the method for judging whether the text is meaningful through the natural language processing technology is generally to adopt the confusion (PPL) size of the language model or to use the deep learning classification model for judgment. However, both methods have certain disadvantages, the size of the PPL value represents a certain trend, the larger the value, the lower the probability of the text occurrence, but none of the determined thresholds indicates that the text larger than the value is meaningless, and the recall rate of the meaningless text judged by using the deep learning classification model is low. Therefore, when the integrated model xgboost is used to judge whether the text is meaningful, the judgment can depend on various features such as the result of learning the classification model and the features such as the PPL of the language model, and the accuracy and the recall rate of the model can be improved.
And 103, inputting the feature vector into the trained meaningless text recognition model, and judging whether the text information is a meaningless text according to an output result of the meaningless text recognition model.
In this embodiment, the meaningless text is a text that does not conform to a conventional expression manner, that is, the expression of characters in the text is a common expression manner.
It should be noted that, generally speaking, if the text information is a text in which the voice command is mixed with the interference information such as the environmental noise, the content of the text information cannot be found in the commonly used expression mode if the text information is interrupted by the interference information, that is, the meaningless text is actually different from the command actually issued, and if the meaningless text is responded, the response is necessarily different from the intention contained in the command, that is, the response is wrong. Therefore, the nature of determining whether the text information is meaningless text is to determine whether the text information can be correctly responded, and if the text information cannot be correctly responded, the text information does not need to be responded.
And 104, if the text information is not a meaningless text, responding the text information after detecting that the text information needs to be responded by using the trained response judgment model.
In this embodiment, even if the text is a meaningful text, that is, the text is slightly or hardly interfered, the text information does not necessarily need to be responded, for example, when the text information is "i read a book", "weather is clear", and the like, the user may not need to respond by the electronic device, or the ASR may recognize chat information of people around, which may cause a phenomenon that the robot continuously responds, for example, "mom and me also go", "lao po bai", does not issue an instruction to the robot, and does not need to respond. Therefore, after the text information is determined, it is also necessary to determine whether or not a response to the text information is required.
As is clear from the description of step 103, when a meaningless text is determined, a response is not required. Thus, in some embodiments, referring to fig. 2, step 103 is: inputting the feature vector into a trained meaningless text recognition model, judging whether the text information is a meaningless text according to an output result of the meaningless text recognition model, if so, executing a step 105, and if not, executing a step 106.
Step 103 is followed by the steps of:
step 105, abandoning the response to the text message.
And step 106, judging whether the text information needs to be responded or not by using the trained response judgment model, if so, executing step 107, and if not, executing step 105.
Step 107, the text message is responded.
The step 105-107 is actually: for text information, firstly calling a meaningless text recognition model to judge whether the text information is meaningful, if not, stopping subsequent processing, and not responding, and if so, calling a response judgment model to judge whether the text is responded. For example, the ASR recognition text is' asking for the flight to check the airplane, the text passes through the meaningless text recognition model, the result is a meaningful text, the response judgment model is continuously called, the judgment result is required to be responded, and the robot gives a response result; for the text that the user does not know me, i.e. the user calculates me towards the value, the result is a meaningless text through the meaningless text recognition model, the response judgment model is not called any more, and the robot does not respond.
It should be noted that step 106 and step 107 are equivalent to step 104, and a specific implementation manner is only given here, and other steps may also be split or combined, and thus, details are not described here any more.
The above embodiments illustrate how speech interaction is performed using models, and the following embodiments will illustrate how models are trained.
In some embodiments, step 103 is preceded by step 108: training the meaningless text recognition model, with reference to fig. 3, step 108 specifically includes the following steps:
at step 1081, an initial training set is constructed that contains both nonsense and meaningful text.
In this embodiment, the initial training set may be an existing open source data set, or may be any corpus containing meaningless text and meaningful text. The embodiment does not limit the number of texts in the initial training set and the size of the data set, and does not limit the proportion of meaningful texts and meaningless texts in the data set.
It should be noted that the text in the initial data set is not a simple text, but a text with a label of meaning or not, such as the text "please turn on the light" is labeled as meaning, the text "answer the warm of my score" is labeled as meaningless, and so on.
And step 1082, extracting features of the meaningless text and the meaningful text included in the initial training set, and using the obtained feature vectors as an identification training set.
And 1083, training the meaningless text recognition model by using the recognition training set to obtain the trained meaningless text recognition model.
It should be noted that, in this embodiment, the training mode is not limited, for example, a training target and the like, and how to train may be determined according to an actual situation.
It should be noted that, in the embodiment, the extracted features of the meaningless text and the meaningful text are used for training instead of the meaningless text and the meaningful text, so that the sample used for training can better reflect the features of the text in the sense aspect, and does not excessively refer to other features of the text, so that the model can identify the sense more accurately, and the identification effect of the meaningless text identification model is improved.
It should be further noted that, the feature extraction in step 1082 may be implemented by performing feature extraction from multiple dimensions respectively by using an LSTM model, a unicram model, and a BERT model in a natural language understanding NLU model, and further, in some embodiments, before step 108, step 109 is further included: training an NLU model, wherein the NLU model includes an LSTM model, a unicram model, and a BERT model, referring to fig. 4, step 109 specifically includes the following steps:
in step 1091, a BERT training set is constructed that contains both nonsense and meaningful text.
In this embodiment, the meaning of the BERT training set is substantially the same as the meaning of the initial training set in step 1081, which is not described herein again.
Step 1092, train the BERT model using the BERT training set.
The present embodiment actually optimizes the BERT model using the BERT training set.
It is worth mentioning that the BERT model is trained by using a training set consisting of meaningful texts and meaningless texts, so that the BERT model can sense whether texts are meaningful or not, and the perception degree of output results of the BERT model in the meaningful direction or not is improved.
Step 1093, train the Unigram model and LSTM model with the open source dataset.
In this embodiment, the opening source data set may be, for example, wikipedia, novels, news, etc., which are only specifically illustrated above, and may also be other types of opening source data sets, and details are not repeated here.
It is worth mentioning that, because the open source data set is used, a large amount of training data is provided, the unsupervised learning of the language model can be better performed, and the training that cannot meet the requirements of precision, accuracy and the like is avoided.
In constructing a data set containing meaningless text and meaningful text, such as the initial training set in step 1081 and the construction of the BERT training set in step 1091, a large amount of meaningless text needs to be obtained to improve the training effect, but the large amount of meaningless text means a large amount of manual labeling work, and therefore, in some embodiments, referring to fig. 5, the way of obtaining meaningless text includes:
step 501, obtaining texts which do not conform to the conventional expression mode and texts which conform to the conventional expression mode from the corpus.
Step 502, randomly adjusting the text conforming to the conventional expression mode, wherein the adjustment operation comprises one or a combination of the following operations: disorder processing, cutting processing and splicing processing.
In this embodiment, a text that conforms to a conventional expression, that is, a meaningful text, is adjusted, for example, the cut text is spliced with other texts or parts of other texts, the character sequence of the text is disordered, two texts are spliced, and the like, so that the meaningful text simulates a text converted by an interfered voice instruction in an actual scene by adjusting the situations of different sentences and the like.
In one example, the normal meaningful text "you are really beautiful" is randomly shuffled into "you are beautiful" and "the normal meaningful text" you are really beautiful "is randomly cut off into" you are really beautiful ", two sentences of the normal meaningful text" you are really beautiful "how to ask about how to handle the machine" are cut and spliced into "you are really good" and the like.
Step 503, using the adjusted texts conforming to the conventional expression mode and the texts not conforming to the conventional expression mode as meaningless texts in the initial training set and the BERT training set.
It is worth mentioning that the situations of meaningless texts are various due to the surrounding environment noise, and a large amount of labor cost is consumed if training data are searched, constructed and labeled simply by manpower. The step 501 and the step 502 directly generate new meaningless texts by cutting and randomly combining the meaningless texts without manual judgment and marking, and the capacity of the data set is enlarged without increasing the consumption of human resources.
In addition, considering that some BERT models actually encode and decode the input, the output result is similar to the input, and the multiple features of the input of the meaningless text recognition model include the output result of the BERT model, that is, when the BERT model and the meaningless text recognition model are trained by using the same data set, the data for training the BERT model and the data for training the meaningless text recognition model have certain content coincidence, and further the meaningless text recognition model after training has an over-fitting problem.
Further, in some embodiments, in order to make the intersection of the BERT training set and the initial training set empty, a larger data set D including meaningless text and meaningful text may be obtained first, and then the data set D is segmented to obtain two data sets D1 and D2, D1 and D2 are respectively used as the BERT training set and the initial training set, the data sets D1 and D2 may be the same in size or different in size, and the number, the proportion, and the like of the meaningless text and the meaningless text may be the same or different, which is not necessarily described herein.
In addition, the voice interaction method provided by the invention also relates to a response judgment model, in some embodiments, the response judgment model is a FastText model, and before the trained FastText model is used in step 104, the FastText model is trained by using a pre-constructed response data set to obtain the trained FastText model, wherein the response data set comprises texts needing to be responded and texts not needing to be responded, namely, whether the texts need to be responded is marked.
In order to better illustrate the effect of the present invention, the following experiment results of the voice interaction method provided by the present invention and the traditional voice interaction method for improving the accuracy of ASR are compared:
text obtained by ASR processing Conventional methods The invention
Asking for a request for a local value Answering Answering
The no-event me is to calculate me toward the value Answering Do not respond to
Mom and me also need to go Answering Do not respond to
Ask me Answering Do not respond to
As shown in the above table, it can be seen that when the interference is too large, the conventional method cannot avoid the problem of wrong response due to response according to the wrong text, for example, the text processed by ASR is "do not know me is doing me at the time of calculating me" and the conventional method still responds.
The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.
An embodiment of the present invention further provides a speech recognition system, as shown in fig. 6, including:
the obtaining module 601 is configured to obtain text information obtained by processing a speech signal through automatic speech recognition ASR, where the speech signal is a sound signal obtained from an environment.
The feature extraction module 602 is configured to perform feature extraction on the text information to obtain a feature vector of the text information.
And the meaning judgment module 603 is configured to input the feature vector into the trained meaningless text recognition model, and judge whether the text information is a meaningless text according to an output result of the meaningless text recognition model, where the meaningless text is a text that does not conform to a conventional expression manner.
The response module 604 is configured to, if the text information is not a meaningless text, respond to the text information after detecting that a response to the text information is needed by using the trained response determination model.
It should be understood that the present embodiment is a system embodiment corresponding to the above method embodiments, and the present embodiment can be implemented in cooperation with the method embodiments. The related technical details mentioned in the method embodiment are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related art details mentioned in the present embodiment can also be applied in the method embodiments.
It should be noted that, all the modules involved in this embodiment are logic modules, and in practical application, one logic unit may be one physical unit, may also be a part of one physical unit, and may also be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, a unit which is not so closely related to solve the technical problem proposed by the present invention is not introduced in the present embodiment, but this does not indicate that there is no other unit in the present embodiment.
An embodiment of the present invention further provides an electronic device, as shown in fig. 7, including:
at least one processor 701; and the number of the first and second groups,
a memory 702 communicatively coupled to the at least one processor 701; wherein the content of the first and second substances,
the memory 702 stores instructions executable by the at least one processor 701 to enable the at least one processor 701 to perform a speech recognition method according to an embodiment of the present invention.
The memory and the processor are connected by a bus, which may include any number of interconnected buses and bridges, linking together one or more of the various circuits of the processor and the memory. The bus may also link various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.
The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.
Those skilled in the art can understand that all or part of the steps in the method of the foregoing embodiments may be implemented by a program to instruct related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims (10)

1. A method of voice interaction, comprising:
acquiring text information obtained after a voice signal is subjected to automatic speech recognition ASR processing, wherein the voice signal is a sound signal acquired from an environment;
extracting the features of the text information to obtain a feature vector of the text information;
inputting the feature vector into a trained meaningless text recognition model, and judging whether the text information is a meaningless text according to an output result of the meaningless text recognition model, wherein the meaningless text is a text which does not conform to a conventional expression mode;
and if the text information is not the meaningless text, responding the text information after detecting that the text information needs to be responded by using the trained response judgment model.
2. The method of speech interaction according to claim 1, wherein before said inputting said feature vectors into a trained nonsense text recognition model, said method further comprises:
constructing an initial training set containing the meaningless text and the meaningful text simultaneously;
extracting features of the meaningless text and the meaningful text contained in the initial training set, and taking the obtained feature vectors as an identification training set;
and training the meaningless text recognition model by using the recognition training set to obtain the trained meaningless text recognition model.
3. The voice interaction method of claim 2, wherein the performing feature extraction comprises:
respectively extracting features from a plurality of dimensions by using an LSTM model, a Unigram model and a BERT model in a natural language understanding NLU model;
before the feature extraction, the method further comprises:
constructing a BERT training set simultaneously containing the meaningless text and the meaningful text;
training the BERT model by using the BERT training set;
the Unigram model and the LSTM model are trained using an open source dataset.
4. The method of claim 3, wherein the obtaining of the meaningless text in the initial training set and the BERT training set comprises:
acquiring texts which do not conform to the conventional expression mode and texts which conform to the conventional expression mode from a corpus;
randomly performing an adjusting operation on the text conforming to the conventional expression mode, wherein the adjusting operation comprises one or a combination of the following operations: disorder processing, cutting processing and splicing processing;
and taking the adjusted texts conforming to the conventional expression mode and the texts not conforming to the conventional expression mode as the meaningless texts in the initial training set and the BERT training set.
5. The method of claim 3, wherein the intersection of the BERT training set and the initial training set is an empty set.
6. A voice interaction method according to any one of claims 1 to 5, characterised in that the meaningless text recognition model is a extreme gradient boost XGboost model.
7. The method of claim 1, wherein the response determination model is a FastText model, and wherein the method further comprises, before responding to the text message after detecting a need to respond to the text message by using the trained response determination model, the method further comprising:
and training the FastText model by utilizing a pre-constructed response data set to obtain the trained FastText model, wherein the response data set comprises texts needing to be responded and texts not needing to be responded.
8. A voice interaction system, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring text information obtained by processing a voice signal through Automatic Speech Recognition (ASR), and the voice signal is a sound signal acquired from the environment;
the characteristic extraction module is used for extracting the characteristics of the text information to obtain a characteristic vector of the text information;
the meaning judgment module is used for inputting the feature vectors into a trained meaningless text recognition model and judging whether the text information is a meaningless text according to an output result of the meaningless text recognition model, wherein the meaningless text is a text which does not conform to a conventional expression mode;
and the response module is used for responding the text information after detecting that the text information needs to be responded by using the trained response judgment model if the text information is not the meaningless text.
9. An electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of voice interaction of any of claims 1-7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method for voice interaction according to any one of claims 1 to 7.
CN202110707954.6A 2021-06-24 2021-06-24 Voice interaction method, system, electronic equipment and storage medium Pending CN113362815A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110707954.6A CN113362815A (en) 2021-06-24 2021-06-24 Voice interaction method, system, electronic equipment and storage medium
PCT/CN2021/140759 WO2022267405A1 (en) 2021-06-24 2021-12-23 Speech interaction method and system, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110707954.6A CN113362815A (en) 2021-06-24 2021-06-24 Voice interaction method, system, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113362815A true CN113362815A (en) 2021-09-07

Family

ID=77536301

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110707954.6A Pending CN113362815A (en) 2021-06-24 2021-06-24 Voice interaction method, system, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN113362815A (en)
WO (1) WO2022267405A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114283794A (en) * 2021-12-14 2022-04-05 达闼科技(北京)有限公司 Noise filtering method, noise filtering device, electronic equipment and computer readable storage medium
WO2022267405A1 (en) * 2021-06-24 2022-12-29 达闼机器人股份有限公司 Speech interaction method and system, electronic device, and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003108581A (en) * 2001-09-27 2003-04-11 Mitsubishi Electric Corp Interactive information retrieving device and interactive information retrieving method
CN110674276A (en) * 2019-09-23 2020-01-10 深圳前海微众银行股份有限公司 Robot self-learning method, robot terminal, device and readable storage medium
CN111554293A (en) * 2020-03-17 2020-08-18 深圳市奥拓电子股份有限公司 Method, device and medium for filtering noise in voice recognition and conversation robot
CN111966706A (en) * 2020-08-19 2020-11-20 中国银行股份有限公司 Official micro-response method and device
CN112735465A (en) * 2020-12-24 2021-04-30 广州方硅信息技术有限公司 Invalid information determination method and device, computer equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9378755B2 (en) * 2014-05-30 2016-06-28 Apple Inc. Detecting a user's voice activity using dynamic probabilistic models of speech features
US10489393B1 (en) * 2016-03-30 2019-11-26 Amazon Technologies, Inc. Quasi-semantic question answering
CN107665708B (en) * 2016-07-29 2021-06-08 科大讯飞股份有限公司 Intelligent voice interaction method and system
CN111816172A (en) * 2019-04-10 2020-10-23 阿里巴巴集团控股有限公司 Voice response method and device
CN112614514B (en) * 2020-12-15 2024-02-13 中国科学技术大学 Effective voice fragment detection method, related equipment and readable storage medium
CN113362815A (en) * 2021-06-24 2021-09-07 达闼机器人有限公司 Voice interaction method, system, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003108581A (en) * 2001-09-27 2003-04-11 Mitsubishi Electric Corp Interactive information retrieving device and interactive information retrieving method
CN110674276A (en) * 2019-09-23 2020-01-10 深圳前海微众银行股份有限公司 Robot self-learning method, robot terminal, device and readable storage medium
CN111554293A (en) * 2020-03-17 2020-08-18 深圳市奥拓电子股份有限公司 Method, device and medium for filtering noise in voice recognition and conversation robot
CN111966706A (en) * 2020-08-19 2020-11-20 中国银行股份有限公司 Official micro-response method and device
CN112735465A (en) * 2020-12-24 2021-04-30 广州方硅信息技术有限公司 Invalid information determination method and device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022267405A1 (en) * 2021-06-24 2022-12-29 达闼机器人股份有限公司 Speech interaction method and system, electronic device, and storage medium
CN114283794A (en) * 2021-12-14 2022-04-05 达闼科技(北京)有限公司 Noise filtering method, noise filtering device, electronic equipment and computer readable storage medium

Also Published As

Publication number Publication date
WO2022267405A1 (en) 2022-12-29

Similar Documents

Publication Publication Date Title
US10991366B2 (en) Method of processing dialogue query priority based on dialog act information dependent on number of empty slots of the query
CN109003624B (en) Emotion recognition method and device, computer equipment and storage medium
CN112100349A (en) Multi-turn dialogue method and device, electronic equipment and storage medium
CN111209740B (en) Text model training method, text error correction method, electronic device and storage medium
CN112487139B (en) Text-based automatic question setting method and device and computer equipment
CN111445898B (en) Language identification method and device, electronic equipment and storage medium
CN111753545A (en) Nested entity recognition method and device, electronic equipment and storage medium
CN111062217A (en) Language information processing method and device, storage medium and electronic equipment
CN108228574B (en) Text translation processing method and device
CN109616096A (en) Construction method, device, server and the medium of multilingual tone decoding figure
CN111079418B (en) Named entity recognition method, device, electronic equipment and storage medium
CN113362815A (en) Voice interaction method, system, electronic equipment and storage medium
CN114757176A (en) Method for obtaining target intention recognition model and intention recognition method
CN112016271A (en) Language style conversion model training method, text processing method and device
CN111881297A (en) Method and device for correcting voice recognition text
KR20210059995A (en) Method for Evaluating Foreign Language Speaking Based on Deep Learning and System Therefor
US10614170B2 (en) Method of translating speech signal and electronic device employing the same
CN116821290A (en) Multitasking dialogue-oriented large language model training method and interaction method
CN113935331A (en) Abnormal semantic truncation detection method, device, equipment and medium
CN111968646A (en) Voice recognition method and device
CN116978367A (en) Speech recognition method, device, electronic equipment and storage medium
CN111625636B (en) Method, device, equipment and medium for rejecting man-machine conversation
CN114528851A (en) Reply statement determination method and device, electronic equipment and storage medium
CN113889115A (en) Dialect commentary method based on voice model and related device
KR102107447B1 (en) Text to speech conversion apparatus for providing a translation function based on application of an optional speech model and operating method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 200245 Building 8, No. 207, Zhongqing Road, Minhang District, Shanghai

Applicant after: Dayu robot Co.,Ltd.

Address before: 200245 2nd floor, building 2, no.1508, Kunyang Road, Minhang District, Shanghai

Applicant before: Dalu Robot Co.,Ltd.

CB02 Change of applicant information