WO2022267405A1 - 语音交互方法、系统、电子设备及存储介质 - Google Patents

语音交互方法、系统、电子设备及存储介质 Download PDF

Info

Publication number
WO2022267405A1
WO2022267405A1 PCT/CN2021/140759 CN2021140759W WO2022267405A1 WO 2022267405 A1 WO2022267405 A1 WO 2022267405A1 CN 2021140759 W CN2021140759 W CN 2021140759W WO 2022267405 A1 WO2022267405 A1 WO 2022267405A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
model
text information
nonsense
meaningless
Prior art date
Application number
PCT/CN2021/140759
Other languages
English (en)
French (fr)
Inventor
李翠姣
Original Assignee
达闼机器人股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 达闼机器人股份有限公司 filed Critical 达闼机器人股份有限公司
Publication of WO2022267405A1 publication Critical patent/WO2022267405A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the embodiments of the present application relate to the technical field of voice interaction, and in particular, to a voice interaction method, system, electronic device, and storage medium.
  • Speech interaction usually means that electronic devices such as robots acquire voice signals from the environment, use Automatic Speech Recognition (ASR) to process the voice signals into text information, and then perform Natural Language Processing (NLU) on the text information. ) processing to obtain the intent contained in the text information and determine the response content corresponding to the intent, and then convert the response content from text to speech through text-to-speech (Text To Speech, TTS) processing, and finally output the voice to complete the voice interaction . Since both the NLU processing and the TTS processing are performed based on the text information obtained by the ASR processing, the effect of the ASR processing result will directly affect the response effect of the voice interaction.
  • ASR Automatic Speech Recognition
  • NLU Natural Language Processing
  • the environment in which voice interaction is performed is usually noisy, and interference such as noise is unavoidable, especially in public environments, such as airports, hospitals, etc., where the sound is more noisy and interferes bigger.
  • the acquired voice signal will include a lot of background noise, such as chat information of people around, environmental noise, etc., and then when the voice signal is converted into text information through ASR processing, the background noise will be converted into text information at the same time, resulting in ASR
  • the processing effect is not good, and there are problems with electronic equipment answering randomly and continuously.
  • a feasible solution is to continuously improve the accuracy and precision of ASR processing to reduce noise input.
  • An embodiment of the present application provides a voice interaction method, comprising the following steps: acquiring text information obtained after the voice signal is processed by automatic speech recognition (ASR), wherein the voice signal is a sound signal acquired from the environment; Feature extraction of the text information to obtain the feature vector of the text information; input the feature vector into the trained nonsense text recognition model, and judge whether the text information is nonsense according to the output result of the nonsense text recognition model text, wherein the nonsense text is text that does not conform to the regular expression; if the text information is not the nonsense text, after using the trained response judgment model to detect the need to respond to the text information, Respond to the text message.
  • ASR automatic speech recognition
  • Embodiments of the present application also provide a voice interaction system, including: an acquisition module, configured to acquire text information obtained after the voice signal is processed by automatic speech recognition (ASR), wherein the voice signal is a sound signal acquired from the environment
  • ASR automatic speech recognition
  • the feature extraction module is used to extract the features of the text information to obtain the feature vector of the text information
  • the meaning judgment module is used to input the feature vector into the trained nonsense text recognition model, according to the non-sense
  • the output result of the meaningful text recognition model judges whether the text information is nonsense text, wherein the nonsense text is a text that does not conform to the conventional expression
  • the answering module is used for if the text information is not the nonsense text and responding to the text information after detecting that the text information needs to be answered by using the trained response judgment model.
  • An embodiment of the present application also provides an electronic device, including: at least one processor; and a memory connected in communication with the at least one processor; wherein, the memory stores information that can be executed by the at least one processor. instructions, the instructions are executed by the at least one processor, so that the at least one processor can execute the voice interaction method described above.
  • the embodiment of the present application also provides a computer-readable storage medium storing a computer program, and implementing the above-mentioned voice interaction method when the computer program is executed by a processor.
  • the embodiment of the present application also provides a computer program, which implements the voice interaction method described above when the computer program is executed by a processor.
  • Fig. 1 is the flowchart of the speech interaction method in the embodiment of the present application.
  • Fig. 2 is a flow chart of the voice interaction method including the step of giving up the answer in another embodiment of the present application;
  • FIG. 3 is a flow chart of a speech interaction method including the step of constructing an initial training set in another embodiment of the present application;
  • Fig. 4 is the flow chart of the speech interaction method that comprises the construction BERT training set step in another embodiment of the present application;
  • Fig. 5 is the flow chart of the step of obtaining meaningless text in constructing BERT data set and initial data set in the speech interaction method of the embodiment of the present application;
  • FIG. 6 is a schematic structural diagram of a voice interaction system in another embodiment of the present application.
  • Fig. 7 is a schematic structural diagram of an electronic device in another embodiment of the present application.
  • the purpose of the embodiments of the present application is to provide a voice interaction method, system, electronic equipment, and storage medium, so that the electronic equipment can avoid random answers and non-stop answers without the need to improve the ASR processing precision and accuracy, and improve response in noisy environments.
  • the embodiment of the present application provides a voice interaction method, including: acquiring voice signals obtained after automatic speech recognition (ASR) processing Text information, wherein the voice signal is a sound signal obtained from the environment; feature extraction is performed on the text information to obtain a feature vector of the text information; the feature vector is input into a trained nonsense text recognition model , judging whether the text information is nonsense text according to the output result of the nonsense text recognition model, wherein the nonsense text is a text that does not conform to a regular expression; if the text information is not the nonsense text and responding to the text information after detecting that the text information needs to be answered by using the trained response judgment model.
  • ASR automatic speech recognition
  • the natural speech understanding NLU does not directly understand and respond to the intention, but first understands the text information Feature extraction is carried out, so that the feature vector of the text information obtained through feature extraction can be used as the input of the trained nonsense text recognition model, so as to judge whether the text information is nonsense text according to the output result of the nonsense text recognition model, and in When the text information at the judging position is meaningful text, continue to judge whether a response is required, and only respond when a response is required.
  • ASR automatic speech recognition
  • the text information contains the response error caused by the text corresponding to the noise, and the continuous response is caused by the noise and other sound information being acquired as voice information, that is, to avoid answer errors or to respond to the noise, and then improve the voice interaction in noisy environments. response effect.
  • the voice interaction method is applied to electronic devices capable of voice interaction such as robots and tablets, specifically including:
  • Step 101 acquire text information obtained after the speech signal is processed by automatic speech recognition (ASR).
  • ASR automatic speech recognition
  • the voice signal is acquired from the environment, and then ASR processing is performed on the voice signal to convert the voice signal into text information.
  • the voice signal in this embodiment is a sound signal obtained from the environment, and the voice signal not only includes the user's voice command, but may also include other voice content in the user's environment, such as a user playing a song
  • the voice command "play song A” is issued near the robot, and there are other users talking around the user, and the content of the conversation includes “long time no see", the robot may capture the voice of other users from the environment
  • the voice command is "Play a long time song and see song A".
  • Step 102 feature extraction is performed on the text information to obtain a feature vector of the text information.
  • the ULR model is used to extract the feature vector of text information.
  • the ULR model includes: the LSTM model and the Unigram model in the language model, and the BERT model in the language representation model.
  • step 102 is actually It is: use the LSTM model, Unigram model and BERT model in the natural language understanding NLU model to perform feature extraction from multiple dimensions, and obtain the feature vectors of text information in multiple dimensions.
  • multiple models - LSTM model, Unigram model and BERT model are used for feature extraction, which can obtain feature vectors of multiple dimensions instead of a single perplexity and classification results, making text information Whether it is meaningful to identify can refer to more information, which can improve the accuracy and recall of the model.
  • the feature vectors of text information in multiple dimensions can be represented by the following expressions:
  • S represents the text
  • represents the length of the text
  • N represents the number of words
  • ⁇ x> represents an integer in the orientation
  • BERT(S) represents the probability value obtained through the BERT model
  • Pm(S) P(wn
  • w1, w2,...,w n-1) represents the probability of text S appearing in the LSTM language model
  • wn is the vocabulary in S
  • Pm(w) P(wi
  • the meaningless text recognition model is an extreme gradient boosting XGBoost model.
  • the method of judging whether a text is meaningful through natural language processing technology is usually by using the perplexity (PPL) of the language model or by using a deep learning classification model.
  • PPL perplexity
  • the PPL value represents a certain trend. The larger the value, the smaller the probability of the text appearing, but there is no definite threshold that means that the text greater than this value is meaningless.
  • Use Deep learning classification models have low recall for judging meaningless text. Therefore, when using the integrated model xgboost to judge whether the text is meaningful, it can rely on various features such as the result of learning the classification model and the PPL of the language model when making the judgment, which can improve the accuracy and recall of the model.
  • Step 103 input the feature vector into the trained nonsense text recognition model, and judge whether the text information is nonsense text according to the output result of the nonsense text recognition model.
  • meaningless text is a text that does not conform to a regular expression, that is, the expression of characters in the text is a common expression.
  • the text information is a voice command mixed with interference information such as environmental noise
  • the content of the text information is interrupted by the interference information and cannot be found in the commonly used expressions, that is to say , the meaningless text is actually different from the actual command. If it is answered, the answer must be different from the intention contained in the command, that is, the answer is wrong. Therefore, the essence of judging whether it is meaningless text is to judge whether the text information can be answered correctly, and if it cannot be answered correctly, there is no need to answer.
  • Step 104 if the text information is not meaningless text, respond to the text information after detecting that the text information needs to be answered by using the trained response judgment model.
  • the text information does not necessarily need to be answered.
  • the text information is "I am reading a book” or "the weather is fine”
  • the user may There is no need for electronic equipment to respond, or, ASR may recognize the chat information of people around, resulting in the phenomenon that the robot keeps responding, for example, "Mom, I am going too”, "Wife goodbye” is not an instruction to the robot, and there is no need to carry out answer. Therefore, after the text information is determined, it is also necessary to determine whether to respond to the text information.
  • step 103 is: input the feature vector into the trained nonsense text recognition model, judge whether the text information is nonsense text according to the output result of the nonsense text recognition model, and if so, execute Step 105, if not, go to step 106.
  • step 103 the following steps are also included:
  • Step 105 give up answering the text message.
  • Step 106 use the trained response judgment model to judge whether it is necessary to respond to the text information, if yes, execute step 107, if not, execute step 105.
  • Step 107 responding to the text message.
  • the above steps 105-107 are actually: for the text information, first call the meaningless text recognition model to judge whether the text information is meaningful, if it is not meaningful, stop the subsequent processing, and do not answer, if it is meaningful, call the response judgment model to judge the text information Whether to answer. For example, ASR recognizes the text as "Where is the check-in for the flight?" After the text passes through the meaningless text recognition model, the result is meaningful text, then continue to call the response judgment model, and the judgment result is that a response is required, and the robot gives the response result; for the text "It's okay, I'm counting on the check-in machine.” After the nonsense text recognition model turns out to be nonsense text, the response judgment model will no longer be invoked, and the robot will not answer.
  • step 106 and step 107 are equivalent to step 104, and here is only a specific implementation method, and other steps may be split or combined, etc., and will not be repeated here. .
  • step 108 is also included: training the meaningless text recognition model.
  • step 108 specifically includes the following steps:
  • Step 1081 construct an initial training set containing both nonsense text and meaningful text.
  • the initial training set may be an existing open source data set, or any corpus containing meaningless text and meaningful text.
  • This embodiment does not limit the number of Chinese texts in the initial training set and the size of the data set, nor does it limit the ratio of meaningful texts to meaningless texts in the data set.
  • the texts in the initial data set are not pure texts, but texts marked with meaning or not.
  • the text “Please turn on the light” is marked as meaningful
  • the text “Answer me the warmth of the score” is marked as no meaning etc.
  • Step 1082 perform feature extraction on meaningless text and meaningful text included in the initial training set, and use the obtained feature vector as a recognition training set.
  • Step 1083 using the recognition training set to train the nonsense text recognition model to obtain a trained nonsense text recognition model.
  • the training method is not limited in this embodiment, such as the training target, and how to train can be determined according to the actual situation.
  • the features extracted from meaningless text and meaningful text are used for training, rather than meaningless text and meaningful text itself, so that the samples used for training can better reflect whether the text is meaningful or not features without too much reference to other features of the text, making the model more accurate in identifying whether it is meaningful, and improving the recognition effect of the nonsense text recognition model.
  • step 1082 can be achieved by using the LSTM model, Unigram model and BERT model in the natural language understanding NLU model to perform feature extraction from multiple dimensions.
  • step 108 NLU model is trained, wherein, NLU model includes LSTM model, Unigram model and BERT model, with reference to Fig. 4, step 109 specifically comprises the following steps:
  • Step 1091 constructing a BERT training set containing meaningless text and meaningful text at the same time.
  • the meaning of the BERT training set is roughly the same as that of the initial training set in step 1081, and details will not be repeated here.
  • Step 1092 use the BERT training set to train the BERT model.
  • the BERT model is actually optimized using the BERT training set.
  • the BERT model is trained using a training set composed of meaningful text and meaningless text, so that the BERT model can perceive whether the text is meaningful, and improve the perception of the output of the BERT model in the direction of whether it is meaningful.
  • Step 1093 using the open source data set to train the Unigram model and the LSTM model.
  • the open source data set can be, for example, Wikipedia, novels, news, etc.
  • the above are only specific examples, and it can also be other types of open source data sets, which will not be repeated here.
  • meaningless text In the process of constructing a data set containing meaningless text and meaningful text, such as the construction of the initial training set in step 1081 and the BERT training set in step 1091, it is necessary to obtain a large amount of meaningless text to improve the training effect, but a large number of Meaningless text means a lot of manual labeling work, therefore, in some embodiments, referring to FIG. 5 , ways to obtain meaningless text include:
  • Step 501 acquire texts that do not conform to regular expressions and texts that conform to regular expressions from the corpus.
  • step 502 random adjustment operations are performed on the text conforming to the regular expressions, and the adjustment operations include one or a combination of the following operations: out-of-order processing, cutting processing, and splicing processing.
  • the text that conforms to the regular expression that is, the meaningful text is adjusted, such as splicing the cut part with other text or other cut parts of the text, disrupting the character order of the text, and combining the two texts Stitching, etc., so that the meaningful text can simulate the converted text of the disturbed voice command in the actual scene by adjusting the situation of different sentences.
  • the normal meaningful text "You are so beautiful” is randomly scrambled into “You are so beautiful”, and the normal meaningful text “You are so beautiful” is randomly cut off as “You are so beautiful”. Really”, cut and splice two normal meaningful texts “You are so beautiful” and “I want to ask how to check in” into “You look really good to check in” and so on.
  • step 503 the adjusted text conforming to the regular expression and the text not conforming to the regular expression are used as the meaningless text in the initial training set and the BERT training set.
  • Steps 501 and 502 generate new nonsense texts directly by cutting and randomly combining meaningful texts, without manual judgment and labeling, and without increasing the consumption of human resources while expanding the capacity of the data set.
  • the output result is similar to the input, and the multiple features of the input of the meaningless text recognition model include the output result of the BERT model, that is,
  • the output result of the BERT model that is, When using the same data set to train the BERT model and the nonsense text recognition model, there will be some content overlap between the data used to train the BERT model and the data used to train the nonsense text recognition model, which will lead to the nonsense text recognition model after training.
  • There is an over-fitting problem The intersection of the above BERT training set and the initial training set is an empty set, that is, two non-overlapping data sets are used to train the BERT and the nonsense text recognition model, so that the nonsense text recognition model can be avoided. Overfitting problem.
  • a larger data set D containing meaningless text and meaningful text can be obtained first, and then the data set D is divided to obtain two Datasets D1 and D2, D1 and D2 are respectively used as BERT training set and initial training set.
  • the size of data sets D1 and D2 can be the same or different.
  • the number and ratio of meaningful text and meaningless text can be the same or different. , which will not be repeated here.
  • the voice interaction method provided by the present application also involves a response judgment model.
  • the response judgment model is a FastText model. Perform training to obtain a trained FastText model, wherein the response data set includes text that needs to be answered and text that does not need to be answered, that is, the text is marked whether it needs to be answered.
  • the traditional method cannot avoid the problem of answering errors caused by answering according to the wrong text.
  • the text obtained by ASR processing is "It's okay, I'm counting me at the check-in machine.”
  • the traditional method is still to answer.
  • step division of the above various methods is only for the sake of clarity of description. During implementation, it can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this patent.
  • the embodiment of the present application also provides a speech recognition system, as shown in Figure 6, comprising:
  • the obtaining module 601 is configured to obtain text information obtained after the speech signal is processed by automatic speech recognition (ASR), wherein the speech signal is a sound signal obtained from the environment.
  • ASR automatic speech recognition
  • the feature extraction module 602 is configured to perform feature extraction on the text information to obtain a feature vector of the text information.
  • the meaning judging module 603 is used to input the feature vector into the trained nonsense text recognition model, and judge whether the text information is nonsense text according to the output result of the nonsense text recognition model, wherein the nonsense text does not conform to the conventional expression text.
  • the answering module 604 is configured to answer the text information if the text information is not meaningless text, after using the trained answer judgment model to detect that the text information needs to be answered.
  • this embodiment is a system embodiment corresponding to the above method embodiment, and this embodiment can be implemented in cooperation with the method embodiment.
  • the relevant technical details mentioned in the method embodiments are still valid in this embodiment, and will not be repeated here in order to reduce repetition.
  • the relevant technical details mentioned in this embodiment can also be applied in the method implementation.
  • modules involved in this embodiment are logical modules.
  • a logical unit can be a physical unit, or a part of a physical unit, or multiple physical units. Combination of units.
  • units that are not closely related to solving the technical problem proposed in the present application are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.
  • the embodiment of the present application also provides an electronic device, as shown in FIG. 7 , including:
  • At least one processor 701 and,
  • a memory 702 connected in communication with the at least one processor 701; wherein,
  • the memory 702 stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor 701, so that the at least one processor 701 can perform the speech recognition provided by the embodiment of the present application method.
  • the memory and the processor are connected by a bus
  • the bus may include any number of interconnected buses and bridges, and the bus links one or more processors and various circuits of the memory together.
  • the bus may also link together various other circuits such as peripherals, voltage regulators, and power management circuits, etc., which are well known in the art and therefore will not be further described herein.
  • the bus interface provides an interface between the bus and the transceivers.
  • a transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium.
  • the data processed by the processor is transmitted on the wireless medium through the antenna, further, the antenna also receives the data and transmits the data to the processor.
  • the processor is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interface, voltage regulation, power management, and other control functions. Instead, memory may be used to store data that the processor uses when performing operations.
  • Another aspect of the embodiment of the present application provides a computer-readable storage medium storing a computer program.
  • the above method embodiments are implemented when the computer program is executed by the processor.
  • Another aspect of the embodiment of the present application provides a computer program.
  • the above method embodiments are implemented when the computer program is executed by the processor.
  • the program is stored in a storage medium and includes several instructions to make a device (which can be a single-chip , chip, etc.) or a processor (processor) executes all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

本申请实施例涉及语音交互技术领域,提出了一种语音交互方法、系统、电子设备及存储介质,语音交互方法包括:获取语音信号经自动语音识别ASR处理后得到的文本信息,其中,语音信号为从环境中获取的声音信号;对文本信息进行特征提取,得到文本信息的特征向量;将特征向量输入训练好的无意义文本识别模型,根据无意义文本识别模型的输出结果判断文本信息是否为无意义文本,其中,无意义文本为不符合常规表达方式的文本;若文本信息不是无意义文本,在利用训练好的应答判断模型检测到需要对文本信息进应答后,对文本信息进行应答。

Description

语音交互方法、系统、电子设备及存储介质
相关申请的交叉引用
本申请基于申请号为“CN202110707954.6”、申请日为2021年6月24日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。
技术领域
本申请的实施例涉及语音交互技术领域,特别涉及一种语音交互方法、系统、电子设备及存储介质。
背景技术
语音交互通常是机器人等电子设备从环境中获取语音信号后,利用自动语音识别(Automatic Speech Recognition,ASR)处理将语音信号转换为文本信息,然后对文本信息进行自然语言理解(Natural Language Processing,NLU)处理,获取文本信息中包含的意图并确定与该意图相对应的应答内容,然后通过文本转语音(Text To Speech,TTS)处理将应答内容从文本转换到语音,最终输出语音,完成语音交互。由于NLU处理和TTS处理都是基于ASR处理得到的文本信息进行的,因此,ASR处理结果的效果会直接影响语音交互的应答效果。而在实际的应用场景中,进行语音交互时所处的环境通常比较嘈杂,不可避免存在噪声等干扰,尤其是在公共环境,如机场、医院等场景,所处环境中的声音更嘈杂,干扰更大。在嘈杂环境下,获取的语音信号将包括很多背景噪音,例如周围人的聊天信息、环境噪音等,进而通过ASR处理将语音信号转化为文本信息时会同时将背景噪声转化为文本信息,导致ASR处理效果不好,出现电子设备乱回答、不停回答的问题。一种可行的解决办法是不断提高ASR处理的准确率和精度以减少噪音输入。
然而,从现有的精度、准确度较高的ASR模型的处理结果来看,仍然不能解决电子设备乱回答、不停回答的问题,继续提高ASR处理的精度和准确度也很困难,因此,亟需提供一种新的语音交互方法来避免电子设备乱回答、不停回答的现象,提升嘈杂环境下的应答效果。
发明内容
本申请的实施例提供了一种语音交互方法,包括以下步骤:获取语音信号经自动语音识别ASR处理后得到的文本信息,其中,所述语音信号为从环境中获取的声音信号;对所述文本信息进行特征提取,得到所述文本信息的特征向量;将所述特征向量输入训练好的无意义文本识别模型,根据所述无意义文本识别模型的输出结果判断所述文本信息是否为无意义文本,其中,所述无意义文本为不符合常规表达方式的文本;若所述文本信息不是所述无意义 文本,在利用训练好的应答判断模型检测到需要对所述文本信息进应答后,对所述文本信息进行应答。
本申请的实施方式还提供了一种语音交互系统,包括:获取模块,用于获取语音信号经自动语音识别ASR处理后得到的文本信息,其中,所述语音信号为从环境中获取的声音信号;特征提取模块,用于对所述文本信息进行特征提取,得到所述文本信息的特征向量;意义判断模块,用于将所述特征向量输入训练好的无意义文本识别模型,根据所述无意义文本识别模型的输出结果判断所述文本信息是否为无意义文本,其中,所述无意义文本为不符合常规表达方式的文本;应答模块,用于若所述文本信息不是所述无意义文本,在利用训练好的应答判断模型检测到需要对所述文本信息进应答后,对所述文本信息进行应答。
本申请的实施例还提供了一种电子设备,包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行以上所述的语音交互方法。
本申请的实施例还提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现以上所述的语音交互方法。
本申请的实施例还提供了一种计算机程序,所述计算机程序被处理器执行时实现以上所述的语音交互方法。
附图说明
一个或多个实施例通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施例的限定,附图中具有相同参考数字标号的元件表示为类似的元件,除非有特别申明,附图中的图不构成比例限制。
图1是本申请实施例中的语音交互方法的流程图;
图2是本申请另一实施例中的包括放弃应答步骤的语音交互方法的流程图;
图3是本申请另一实施例中的包括构造初始训练集步骤的语音交互方法的流程图;
图4是本申请另一实施例中的包括构造BERT训练集步骤的语音交互方法的流程图;
图5是本申请实施例语音交互方法中构造BERT数据集和初始数据集中获取无意义文本步骤的流程图;
图6是本申请另一实施例中的语音交互系统的结构示意图;
图7是本申请另一实施例中的电子设备的结构示意图。
具体实施方式
本申请实施例的目的在于提供一种语音交互方法、系统、电子设备及存储介质,使得在 不需要提升ASR处理精度和准确度的情况下能够避免电子设备乱回答、不停回答的现象,提升嘈杂环境下的应答效果。
由背景技术可知,相关技术中通过自动语音识别(Automatic Speech Recognition,ASR)+自然语言理解(Natural Language Processing,NLU)+文本转语音(Text To Speech,TTS)等处理进行语音交互的,其中,ASR模块的识别结果的准确与否将直接影响语音交互过程中的应答效果。为了提高应答效果,常用的办法是提高ASR处理的精度,但是现有的SAR模型精度已经相对较高了,却仍然不能解决电子设备乱回答、不停回答的现象,继续提高ASR处理的精度也很困难。因此,亟需提供一种新的语音交互方法来避免电子设备乱回答、不停回答的现象,提升嘈杂环境下的应答效果。
为了实现能够避免电子设备乱回答、不停回答的现象,提升嘈杂环境下的应答效果,本申请的实施例提供了一种语音交互方法,包括:获取语音信号经自动语音识别ASR处理后得到的文本信息,其中,所述语音信号为从环境中获取的声音信号;对所述文本信息进行特征提取,得到所述文本信息的特征向量;将所述特征向量输入训练好的无意义文本识别模型,根据所述无意义文本识别模型的输出结果判断所述文本信息是否为无意义文本,其中,所述无意义文本为不符合常规表达方式的文本;若所述文本信息不是所述无意义文本,在利用训练好的应答判断模型检测到需要对所述文本信息进应答后,对所述文本信息进行应答。
本申请实施例提供的语音交互方法,在获取了环境中的语音信号经自动语音识别ASR处理后得到的文本信息后,不直接由自然语音理解NLU进行意图理解和应答,而是先对文本信息进行特征提取,使得通过特征提取得到的文本信息的特征向量能够作为训练好的无意义文本识别模型的输入,从而根据无意义文本识别模型的输出结果来判断文本信息是否为无意义文本,并在判断处文本信息为有意义文本时,继续进行是否需要进行应答的判断,只有在需要进行应答的情况下进行应答。这样通过在应答之前对ASR处理得到的文本信息进行检测,只有确定文本信息有意义且需要应答才进行应答,从而保证进行应答的信息是有意义以及需要进行应答的文本信息,排除了存在噪声时文本信息包含噪声对应的文本造成的应答错误,以及由于噪声等声音信息被当作语音信息获取造成不断应答的情况,即避免应答错误或对噪声进行应答,进而语音交互时提高在嘈杂环境中的应答效果。
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请的各实施方式进行详细的阐述。然而,本领域的普通技术人员可以理解,在本申请各实施方式中,为了使读者更好地理解本申请而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施方式的种种变化和修改,也可以实现本申请所要求保护的技术方案。
下面将结合图1-图5对本实施例的数据处理方法的实现细节进行具体的说明,以下内容仅为方便理解提供的实现细节,并非实施本方案的必须。
参考图1,在一些实施例中,语音交互方法应用于如机器人、平板等能够进行语音交互 的电子设备中,具体包括:
步骤101,获取语音信号经自动语音识别ASR处理后得到的文本信息。
具体地,从环境中获取语音信号,然后对语音信号进行ASR处理,将语音信号转化为文本信息。
需要说明的是,本实施例中语音信号为从环境中获取的声音信号,语音信号中不仅包含用户的语音指令,还可能包括用户所在环境中的其他语音内容,如某个用户在具有播放歌曲的需求时,在机器人附近下达语音指令“播放歌曲A”,在该用户周围还有其他用户正在交谈,交谈内容包括“好久不见”,则机器人可能从环境中捕捉到该夹杂有其他用户谈话声的语音指令为“播放好久歌见曲A”。
步骤102,对文本信息进行特征提取,得到文本信息的特征向量。
本实施例中,利用ULR模型提取文本信息的特征向量,例如,在一些实施例中ULR模型包括:语言模型中的LSTM模型和Unigram模型,以及语言表示模型中的BERT模型,此时步骤102实际为:利用自然语言理解NLU模型中的LSTM模型、Unigram模型和BERT模型从多个维度分别进行特征提取,得到文本信息在多个维度上的特征向量。
值得一提的是,进行特征提取时利用了多个模型——LSTM模型、Unigram模型和BERT模型,能够得到多个维度的特征向量,而不是单一的困惑度和分类结果,使得对文本信息进行是否有意义的识别时能够参考更多的信息,能够提高模型的准确率与召回率。
在一个例子中,文本信息在多个维度上的特征向量可以用以下表达式表示:
bert_prob=BERT(S)
Figure PCTCN2021140759-appb-000001
Figure PCTCN2021140759-appb-000002
Figure PCTCN2021140759-appb-000003
Figure PCTCN2021140759-appb-000004
Figure PCTCN2021140759-appb-000005
Figure PCTCN2021140759-appb-000006
Figure PCTCN2021140759-appb-000007
Figure PCTCN2021140759-appb-000008
Figure PCTCN2021140759-appb-000009
其中,S代表文本,|S|代表文本长度,N代表词汇个数,<x>代表取向上取整数,BERT(S)代表经过BERT模型得到的概率值;Pm(S)=P(wn|w1,w2,……,w n-1)代表LSTM语言模型中文本S出现的概率,wn为S中的词汇;Pu(S)=P(w1)P(w2)…P(wn)代表Unigram语言模型中文本S出现的概率,wn为S中的词汇;Pm(w)=P(wi|w1,w2,……,w i-1)代表LSTM语言模型S中当前词汇w出现的概率,w=wi;Pu(w)=P w1)P(w2)…P(wn)代表Unigram语言模型S中当前词汇w出现的概率,w=wi。
在一些实施例中,无意义文本识别模型为极致梯度提升XGBoost模型。
值得一提的是,通过自然语言处理技术判断文本是否有意义的方法通常是采用语言模型的困惑度(perplexity,PPL)大小或者使用深度学习分类模型判断。但是这两种方法都存在一定的不足,PPL值大小代表一定的趋势,值越大代表该文本出现的概率越小,但是没有一个确定的阈值表示大于该值的文本都是无意义的、使用深度学习分类模型判断无意义文本的召回率较低。因此,用集成模型xgboost判断文本是否有意义时,进行判断时能够依赖各种特征如学习分类模型的结果和语言模型的PPL等特征,能够提高模型的准确率与召回率。
步骤103,将特征向量输入训练好的无意义文本识别模型,根据无意义文本识别模型的输出结果判断文本信息是否为无意义文本。
本实施例中,无意义文本是不符合常规表达方式的文本,即文本中文字的表述是常用的表达方式。
需要说明的是,一般而言,若是文本信息是语音指令中夹杂环境噪声等干扰信息的文本,则文本信息的内容被干扰信息打断,就不能在常用中的表达方式中找到,也就是说,无意义文本实际和实际下达的指令不同,如果对其进行应答,应答也必然和指令中蕴含的意图不同,即应答错误。因此,判断是否为无意义文本的本质是在判断文本信息是否能够被正确应答,如果不能被正确应答,也就不需要进行应答了。
步骤104,若文本信息不是无意义文本,在利用训练好的应答判断模型检测到需要对文本信息进应答后,对文本信息进行应答。
本实施例中,即使文本是有意义文本,即受到干扰小或几乎没有干扰,文本信息也不一 定需要进行应答,如文本信息为“我在看书”、“天气晴”等内容时,用户可能并不需要电子设备进行应答,或者,ASR可能识别周围人的聊天信息,导致机器人不停应答的现象,例如“妈妈我也要去”“老婆拜拜了”不是对机器人下达指令,也不需要进行应答。因此,在确定文本信息之后,还需要判断是否需要对文本信息进行应答。
此外,根据对步骤103的说明可知,当判断无意义文本时,不需要进行应答。因此,在一些实施例中,参考图2,步骤103为:将特征向量输入训练好的无意义文本识别模型,根据无意义文本识别模型的输出结果判断文本信息是否为无意义文本,若是,执行步骤105,若否,执行步骤106。
步骤103之后还包括以下步骤:
步骤105,放弃对文本信息进行应答。
步骤106,利用训练好的应答判断模型判断是否需要对文本信息进应答,若是,执行步骤107,若否,执行步骤105。
步骤107,对文本信息进行应答。
上述步骤105-107实际是:对于文本信息,首先调用无意义文本识别模型判断文本信息是否有意义,如果没有意义则停止后续处理,不应答,如果有意义则调用应答判断模型,判断对该文本是否进行应答。例如ASR识别文本为“请问航班在哪里值机”,该文本经过无意义文本识别模型,结果为有意义文本,则继续调用应答判断模型,判断结果为需要应答,机器人给出应答结果;对于文本“没事我这是在向值机算我”,经过无意义文本识别模型结果为无意义文本,则不再调用应答判断模型,机器人不应答。
需要说明的是,上述步骤106和步骤107相当于步骤104,此处只是给出了一种具体地实现方式,还可能对其他步骤进行拆分或合并等,此处就不再一一赘述了。
上述实施例说明了如何利用模型进行语音交互,以下实施例将对如何对模型进行训练进行说明。
在一些实施例中,步骤103之前还包括步骤108:对无意义文本识别模型进行训练,参考图3,步骤108具体包括以下步骤:
步骤1081,构造同时包含无意义文本和有意义文本的初始训练集。
本实施例中,初始训练集可以是已有的开源数据集,也可以是任一种包含无意义文本和有意义文本的语料库。本实施例不对初始训练集的中文本数量和数据集的大小进行限定,也不对数据集中有意义文本和无意义文本的比例等进行限定。
需要说明的是,初始数据集中的文本不是单纯的文本,而是带有是否意义标注的文本,如文本“请开灯”被标注为有意义、文本“回答我比分的温暖”被标注为无意义等。
步骤1082,对初始训练集中包含的无意义文本和有意义文本进行特征提取,将得到的特征向量作为识别训练集。
步骤1083,利用识别训练集对无意义文本识别模型进行训练,得到训练好的无意义文本识别模型。
需要说明的是,本实施例中不对训练的方式进行限定,如训练目标等,可以根据实际情况确定如何训练。
值得一提的是,本实施例中利用无意义文本和有意义文本提取的特征进行训练,而不是无意义文本和有意义文本本身,使得用于训练的样本更能反映文本在是否有意义方面的特征,而不会过多参考文本的其他特征,使得模型对是否有意义的识别更加准确,提高了无意义文本识别模型的识别效果。
还需要说明的是,步骤1082中进行特征提取可以是利用自然语言理解NLU模型中的LSTM模型、Unigram模型和BERT模型从多个维度分别进行特征提取实现的,在一些实施例,在步骤108之前还包括步骤109:对NLU模型进行训练,其中,NLU模型包括LSTM模型、Unigram模型和BERT模型,参考图4,步骤109具体包括以下步骤:
步骤1091,构造同时包含无意义文本和有意义文本的BERT训练集。
本实施例中,BERT训练集的含义和步骤1081中的初始训练集含义大致相同,此处就不一一赘述了。
步骤1092,利用BERT训练集对BERT模型进行训练。
本实施例实际是利用BERT训练集对BERT模型进行优化。
值得一提的是,使用有意义文本和无意义文本组成的训练集来训练BERT模型,使得BERT模型能够对文本是否有意义进行感知,提高BERT模型输出结果在是否有意义方向的感知度。
步骤1093,利用开源数据集对Unigram模型和LSTM模型进行训练。
本实施例中,开源数据集可以是例如维基百科、小说、新闻等,当然以上仅为具体的举例说明,还可以是其他类型的开源数据集,此处就不再一一赘述了。
值得一提的是,由于使用开源数据集,因此具有大量的训练数据,能够更好地对语言模型进行无监督学习,避免训练不能满足精度、准确度等方面的要求。
在构造包含无意义文本和有意义文本的数据集的过程中,如步骤1081中初始训练集和步骤1091中的BERT训练集的构造,需要获取大量的无意义文本以提高训练效果,但是大量的无意义文本意味着大量的人工标注工作,因此,在一些实施例中,参考图5,获取无意义文本的方式包括:
步骤501,从语料库中获取不符合常规表达方式的文本和符合常规表达方式的文本。
步骤502,对符合常规表达方式的文本随机进行调整操作,调整操作包括以下操作中的一种或组合:乱序处理、切割处理和拼接处理。
本实施例中,对符合常规表达式的文本,即有意义文本进行调整,如将切割后与其他文 本或其他文本切割后的部分进行拼接、将文本的字符顺序打乱、将两个文本进行拼接等,从而使得有意义文本通过调整出现语句不同等情况,模拟实际场景中受到干扰的语音指令转化后的文本。
在一个例子中,将正常的有意义文本“你长得真漂亮”随机打乱为“长你得漂亮真”,将正常的有意义文本“你长得真漂亮”随机切断为“你长得真”,将两句正常的有意义文本“你长得真漂亮”“我想问下怎么办理值机”切割并拼接为“你长得真办理值机”等。
步骤503,将调整后的符合常规表达方式的文本和不符合常规表达方式的文本,作为初始训练集和BERT训练集中的无意义文本。
值得一提的是,由于周围环境噪声导致无意义文本的形势多种多样,如果单纯依靠人工查找、构造和标注训练数据需要耗费的大量的人力成本。步骤501和步骤502直接通过有意义文本的切割和随机组合来生成新的无意义文本,不要人工判断和打标注,在扩大数据集容量的同时还不需要增加人力资源的消耗。
此外,考虑到有一些BERT模型实际是对输入进行编码后再解码,因此其输出结果和输入比较相似,而无意义文本识别模型输入的多个特征中包括BERT模型的输出结果,也就是说,使用同一个数据集训练BERT模型和无意义文本识别模型时会存在用于训练BERT模型的数据和用于训练无意义文本识别模型的数据有一定内容重合,进而导致训练后的无意义文本识别模型存在过拟合问题,上述BERT训练集和初始训练集的交集为空集,即使用两份不重合的数据集分别训练BERT与无意义文本识别模型,这样就能避免了无意义文本识别模型的过拟合问题。
在一些实施例中,为了使BERT训练集和初始训练集的交集为空,可以先获取一个较大的包含无意义文本和有意义文本的数据集D,然后对数据集D进行分割得到两个数据集D1和D2,D1和D2分别作为BERT训练集和初始训练集,数据集D1和D2大小可以相同,也可以不同,有意义文本和无意义文本的数量、比例等可以相同,也可以不同,此处就不一一赘述了。
另外,本申请提供的语音交互方法中还涉及应答判断模型,在一些实施例中,应答判断模型为FastText模型,步骤104之前使用训练好的FastText模型之前,利用预先构造的应答数据集对FastText模型进行训练,得到训练好的FastText模型,其中,应答数据集包括需要进行应答的文本和不需要进行应答的文本,即文本被标注了是否需要进行应答。
为了更好地说明本申请的效果,以下将本申请提供语音交互和传统提高ASR精度的语音交互方法的实验结果进行对比:
ASR处理得到的文本 传统方法 本申请
请问南航在哪里值机 应答 应答
没事我这是在向值机算我 应答 不应答
妈妈我也要去 应答 不应答
便问我 应答 不应答
如上表,可以看出但干扰过大时,传统方法不能避免由于根据错误文本进行应答导致的应答错误问题,如ASR处理得到的文本为“没事我这是在向值机算我”传统方法仍然进行应答。
上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包括相同的逻辑关系,都在本专利的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在该专利的保护范围内。
本申请的实施例还提供了一种语音识别系统,如图6所示,包括:
获取模块601,用于获取语音信号经自动语音识别ASR处理后得到的文本信息,其中,语音信号为从环境中获取的声音信号。
特征提取模块602,用于对文本信息进行特征提取,得到文本信息的特征向量。
意义判断模块603,用于将特征向量输入训练好的无意义文本识别模型,根据无意义文本识别模型的输出结果判断文本信息是否为无意义文本,其中,无意义文本为不符合常规表达方式的文本。
应答模块604,用于若文本信息不是无意义文本,在利用训练好的应答判断模型检测到需要对文本信息进应答后,对文本信息进行应答。
不难发现,本实施例为与上述方法实施例相对应的系统实施例,本实施例可与方法实施例互相配合实施。方法实施例中提到的相关技术细节在本实施例中依然有效,为了减少重复,这里不再赘述。相应地,本实施例中提到的相关技术细节也可应用在方法实施方式中。
值得一提的是,本实施例中所涉及到的各模块均为逻辑模块,在实际应用中,一个逻辑单元可以是一个物理单元,也可以是一个物理单元的一部分,还可以以多个物理单元的组合实现。此外,为了突出本申请的创新部分,本实施例中并没有将与解决本申请所提出的技术问题关系不太密切的单元引入,但这并不表明本实施例中不存在其它的单元。
本申请实施例还提供了一种电子设备,如图7所示,包括:
至少一个处理器701;以及,
与所述至少一个处理器701通信连接的存储器702;其中,
所述存储器702存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器701执行,以使所述至少一个处理器701能够执行本申请实施例提供的语音识别方法。
其中,存储器和处理器采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器和存储器的各种电路链接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路链接在一起,这些都是本领域所公知的,因此, 本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传送给处理器。
处理器负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器可以被用于存储处理器在执行操作时所使用的数据。
本申请实施例另一方面还提供了一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述方法实施例。
本申请实施例另一方面还提供了一种计算机程序。计算机程序被处理器执行时实现上述方法实施例。
本领域技术人员可以理解实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
本领域的普通技术人员可以理解,上述各实施方式是实现本申请的具体实施例,而在实际应用中,可以在形式上和细节上对其作各种改变,而不偏离本申请的精神和范围。

Claims (11)

  1. 一种语音交互方法,包括:
    获取语音信号经自动语音识别ASR处理后得到的文本信息,其中,所述语音信号为从环境中获取的声音信号;
    对所述文本信息进行特征提取,得到所述文本信息的特征向量;
    将所述特征向量输入训练好的无意义文本识别模型,根据所述无意义文本识别模型的输出结果判断所述文本信息是否为无意义文本,其中,所述无意义文本为不符合常规表达方式的文本;
    若所述文本信息不是所述无意义文本,在利用训练好的应答判断模型检测到需要对所述文本信息进应答后,对所述文本信息进行应答。
  2. 根据权利要求1中所述的语音交互方法,其中,所述将所述特征向量输入训练好的无意义文本识别模型之前,所述方法还包括:
    构造同时包含所述无意义文本和有意义文本的初始训练集;
    对所述初始训练集中包含的所述无意义文本和所述有意义文本进行特征提取,将得到的所述特征向量作为识别训练集;
    利用所述识别训练集对所述无意义文本识别模型进行训练,得到训练好的所述无意义文本识别模型。
  3. 根据权利要求1或2所述的语音交互方法,其中,所述进行特征提取,包括:
    利用自然语言理解NLU模型中的LSTM模型、Unigram模型和BERT模型从多个维度分别进行特征提取;
    所述进行特征提取之前,所述方法还包括:
    构造同时包含所述无意义文本和有意义文本的BERT训练集;
    利用所述BERT训练集对所述BERT模型进行训练;
    利用开源数据集对所述Unigram模型和所述LSTM模型进行训练。
  4. 根据权利要求3所述的语音交互方法,其中,所述初始训练集和所述BERT训练集中的所述无意义文本的获取方式,包括:
    从语料库中获取不符合常规表达方式的文本和符合常规表达方式的文本;
    对所述符合常规表达方式的文本随机进行调整操作,所述调整操作包括以下操作中的一种或组合:乱序处理、切割处理和拼接处理;
    将调整后的所述符合常规表达方式的文本和所述不符合常规表达方式的文本,作为所述初始训练集和所述BERT训练集中的所述无意义文本。
  5. 根据权利要求3或4所述的语音交互方法,其中,所述BERT训练集和所述初始训练集的交集为空集。
  6. 根据权利要求1-5中任一项所述的语音交互方法,其中,所述无意义文本识别模型为极致梯度提升XGBoost模型。
  7. 根据权利要求1-6中任一项所述的语音交互方法,其中,所述应答判断模型为FastText模型,所述在利用训练好的应答判断模型检测到需要对所述文本信息进应答后,对所述文本信息进行应答之前,所述方法还包括:
    利用预先构造的应答数据集对所述FastText模型进行训练,得到训练好的所述FastText模型,其中,所述应答数据集包括需要进行应答的文本和不需要进行应答的文本。
  8. 一种语音交互系统,包括:
    获取模块,用于获取语音信号经自动语音识别ASR处理后得到的文本信息,其中,所述语音信号为从环境中获取的声音信号;
    特征提取模块,用于对所述文本信息进行特征提取,得到所述文本信息的特征向量;
    意义判断模块,用于将所述特征向量输入训练好的无意义文本识别模型,根据所述无意义文本识别模型的输出结果判断所述文本信息是否为无意义文本,其中,所述无意义文本为不符合常规表达方式的文本;
    应答模块,用于若所述文本信息不是所述无意义文本,在利用训练好的应答判断模型检测到需要对所述文本信息进应答后,对所述文本信息进行应答。
  9. 一种电子设备,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理 器执行,以使所述至少一个处理器能够执行如权利要求1-7任一所述的语音交互方法。
  10. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1-7中任一项所述的语音交互方法。
  11. 一种计算机程序,所述计算机程序被处理器执行时实现权利要求1-7中任一项所述的语音交互方法。
PCT/CN2021/140759 2021-06-24 2021-12-23 语音交互方法、系统、电子设备及存储介质 WO2022267405A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110707954.6 2021-06-24
CN202110707954.6A CN113362815A (zh) 2021-06-24 2021-06-24 语音交互方法、系统、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022267405A1 true WO2022267405A1 (zh) 2022-12-29

Family

ID=77536301

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/140759 WO2022267405A1 (zh) 2021-06-24 2021-12-23 语音交互方法、系统、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN113362815A (zh)
WO (1) WO2022267405A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362815A (zh) * 2021-06-24 2021-09-07 达闼机器人有限公司 语音交互方法、系统、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150348572A1 (en) * 2014-05-30 2015-12-03 Apple Inc. Detecting a user's voice activity using dynamic probabilistic models of speech features
CN107665708A (zh) * 2016-07-29 2018-02-06 科大讯飞股份有限公司 智能语音交互方法及系统
US10489393B1 (en) * 2016-03-30 2019-11-26 Amazon Technologies, Inc. Quasi-semantic question answering
CN111816172A (zh) * 2019-04-10 2020-10-23 阿里巴巴集团控股有限公司 一种语音应答方法及装置
CN112614514A (zh) * 2020-12-15 2021-04-06 科大讯飞股份有限公司 有效语音片段检测方法、相关设备及可读存储介质
CN113362815A (zh) * 2021-06-24 2021-09-07 达闼机器人有限公司 语音交互方法、系统、电子设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4162074B2 (ja) * 2001-09-27 2008-10-08 三菱電機株式会社 対話型情報検索装置
CN110674276A (zh) * 2019-09-23 2020-01-10 深圳前海微众银行股份有限公司 机器人自学习方法、机器人终端、装置及可读存储介质
CN111554293B (zh) * 2020-03-17 2023-08-22 深圳市奥拓电子股份有限公司 语音识别中噪音的过滤方法、装置、介质及对话机器人
CN111966706B (zh) * 2020-08-19 2023-08-22 中国银行股份有限公司 官微答复方法及装置
CN112735465B (zh) * 2020-12-24 2023-02-24 广州方硅信息技术有限公司 无效信息确定方法、装置、计算机设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150348572A1 (en) * 2014-05-30 2015-12-03 Apple Inc. Detecting a user's voice activity using dynamic probabilistic models of speech features
US10489393B1 (en) * 2016-03-30 2019-11-26 Amazon Technologies, Inc. Quasi-semantic question answering
CN107665708A (zh) * 2016-07-29 2018-02-06 科大讯飞股份有限公司 智能语音交互方法及系统
CN111816172A (zh) * 2019-04-10 2020-10-23 阿里巴巴集团控股有限公司 一种语音应答方法及装置
CN112614514A (zh) * 2020-12-15 2021-04-06 科大讯飞股份有限公司 有效语音片段检测方法、相关设备及可读存储介质
CN113362815A (zh) * 2021-06-24 2021-09-07 达闼机器人有限公司 语音交互方法、系统、电子设备及存储介质

Also Published As

Publication number Publication date
CN113362815A (zh) 2021-09-07

Similar Documents

Publication Publication Date Title
US20210210076A1 (en) Facilitating end-to-end communications with automated assistants in multiple languages
US11514891B2 (en) Named entity recognition method, named entity recognition equipment and medium
CN111477216B (zh) 一种用于对话机器人的音意理解模型的训练方法及系统
US7860705B2 (en) Methods and apparatus for context adaptation of speech-to-speech translation systems
JP2017058673A (ja) 対話処理装置及び方法と知能型対話処理システム
US10242670B2 (en) Syntactic re-ranking of potential transcriptions during automatic speech recognition
CN109616096A (zh) 多语种语音解码图的构建方法、装置、服务器和介质
US20030040907A1 (en) Speech recognition system
KR102321801B1 (ko) 지능적 음성 인식 방법, 음성 인식 장치 및 지능형 컴퓨팅 디바이스
WO2022252636A1 (zh) 基于人工智能的回答生成方法、装置、设备及存储介质
JP7158217B2 (ja) 音声認識方法、装置及びサーバ
US20200013395A1 (en) Intelligent voice recognizing method, apparatus, and intelligent computing device
US11532301B1 (en) Natural language processing
KR20200084260A (ko) 전자 장치 및 이의 제어 방법
US11907665B2 (en) Method and system for processing user inputs using natural language processing
JP6625772B2 (ja) 検索方法及びそれを用いた電子機器
US11216497B2 (en) Method for processing language information and electronic device therefor
US10614170B2 (en) Method of translating speech signal and electronic device employing the same
CN113674742A (zh) 人机交互方法、装置、设备以及存储介质
WO2022267405A1 (zh) 语音交互方法、系统、电子设备及存储介质
WO2023045186A1 (zh) 意图识别方法、装置、电子设备和存储介质
CN113488026B (zh) 基于语用信息的语音理解模型生成方法和智能语音交互方法
KR20190074508A (ko) 챗봇을 위한 대화 모델의 데이터 크라우드소싱 방법
WO2023272616A1 (zh) 一种文本理解方法、系统、终端设备和存储介质
US11626107B1 (en) Natural language processing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21946879

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE