CN114822533B - Voice interaction method, model training method, electronic device and storage medium - Google Patents

Voice interaction method, model training method, electronic device and storage medium Download PDF

Info

Publication number
CN114822533B
CN114822533B CN202210378175.0A CN202210378175A CN114822533B CN 114822533 B CN114822533 B CN 114822533B CN 202210378175 A CN202210378175 A CN 202210378175A CN 114822533 B CN114822533 B CN 114822533B
Authority
CN
China
Prior art keywords
user voice
request
voice request
interaction method
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210378175.0A
Other languages
Chinese (zh)
Other versions
CN114822533A (en
Inventor
李万水
陈光毅
翁志伟
孙仿逊
李晨延
赵耀
易晖
李嘉辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Xiaopeng Motors Technology Co Ltd
Original Assignee
Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Xiaopeng Motors Technology Co Ltd filed Critical Guangzhou Xiaopeng Motors Technology Co Ltd
Priority to CN202210378175.0A priority Critical patent/CN114822533B/en
Publication of CN114822533A publication Critical patent/CN114822533A/en
Application granted granted Critical
Publication of CN114822533B publication Critical patent/CN114822533B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a voice interaction method, a model training method, electronic equipment and a storage medium. The voice interaction method comprises the following steps: acquiring user voice data to perform voice recognition in real time to acquire a user voice request; under the condition that a complete user voice request is not received, predicting the user voice request according to the user voice request obtained in real time and the cyclic neural network model to obtain a prediction result; processing the prediction result to obtain a first prediction instruction; after receiving the complete user voice request, if the prediction result is the same as the received complete user voice request, completing voice interaction according to the first prediction instruction. The invention predicts and complements the user voice request based on the cyclic neural network model, reduces the total time required by the dialogue system to process the user voice request, adopts a light-weight model, can accurately and rapidly predict the conventional sentences, and can control the richness, the parameter quantity, the time cost and the cost.

Description

Voice interaction method, model training method, electronic device and storage medium
Technical Field
The present invention relates to the field of voice interaction technologies, and in particular, to a voice interaction method, a model training method, an electronic device, and a storage medium.
Background
For text completion, similar services include prompt words of an input method, input prompts of a search box, completion prompts of codes, automatic generation of texts and the like. First, conventional search algorithms can be used, but this requires storing large amounts of data in advance, searching in a vast database, and improving efficiency and accuracy using space-time methods. Secondly, a model can be generated by using a popular pre-training text in recent years, and the model has high accuracy and rich diversity, but the model parameter is huge, the text in a specific field also needs to be trained for a longer time, the time for generating the text is longer, the requirement on equipment is higher, and the time cost is also higher.
However, in the speech recognition processing scenario, the user intention is predicted in advance to obtain time benefit within the normal speaking time of the user, and the normal use of the system is not affected, so that the model is required to be generated quickly, with high accuracy and low throughput, so that the model requirement cannot be too large, diversity is not pursued, and the storage space is required to be as small as possible to save the cost.
Disclosure of Invention
The embodiment of the invention provides a voice interaction method, a model training method, electronic equipment and a storage medium.
The embodiment of the invention provides a voice interaction method. The voice interaction method comprises the following steps: acquiring user voice data to perform voice recognition in real time to acquire a user voice request; under the condition that the complete user voice request is not received, predicting the user voice request according to the user voice request and a cyclic neural network model acquired in real time to obtain a prediction result; processing the prediction result to obtain a first prediction instruction; and after receiving the complete user voice request, if the prediction result is the same as the received complete user voice request, completing voice interaction according to the first prediction instruction.
Therefore, the voice interaction method of the invention predicts and complements the user voice request based on the cyclic neural network model, reduces the total time required by the dialogue system to process the user voice request, adopts a light-weight model, can accurately and rapidly predict the conventional sentences, and can control the richness, the parameter quantity, the time cost and the cost.
Under the condition that the complete user voice request is not received, predicting the user voice request according to the user voice request and a cyclic neural network model acquired in real time to obtain a prediction result, wherein the method comprises the following steps: providing the user voice request acquired in real time as current input for the cyclic neural network model to predict to obtain a predicted character; under the condition that the predicted character is not a preset character, splicing the predicted character and the current input to obtain a next input, and providing the next input for the cyclic neural network model to predict again; and taking the current input as the prediction result when the predicted character is a preset character.
Thus, the incomplete user voice request can be predicted according to the cyclic neural network model to obtain a prediction result.
And under the condition that the predicted character is not a preset character, splicing the predicted character and the current input to obtain a next input, and providing the next input to the recurrent neural network for predicting again, wherein the method comprises the following steps: acquiring confidence coefficient of the predicted character and entropy of the predicted probability distribution; and under the condition that the confidence coefficient is larger than a first threshold value and the entropy of the prediction probability distribution is smaller than a second threshold value, splicing the predicted character and the current input to obtain a next input, and providing the next input to the recurrent neural network for prediction again.
Thus, the invention establishes a double definition that the confidence of the predicted result needs to be higher than a first threshold value, and the entropy of the predicted distribution needs to be lower than a second threshold value, and only if each predicted word meets both conditions at the same time, the two conditions are "approved", so that the number of results with high probability is small, and the throughput is controlled.
The voice interaction method comprises the following steps: and under the condition that the number of characters input by the cyclic neural network model is larger than the maximum predicted number of characters, determining that the prediction fails.
Therefore, the total duration required by the dialogue system for processing the user voice request can be reduced, and the prediction efficiency of the cyclic neural network model is improved.
Under the condition that the complete user voice request is not received, predicting the user voice request according to the user voice request and a cyclic neural network model acquired in real time to obtain a prediction result, wherein the method comprises the following steps: under the condition that the complete user voice request is not received, completing the user voice request acquired in real time based on a prefix tree; and under the condition that the completion result is not obtained by the completion based on the prefix tree, predicting the user voice request according to the user voice request and the cyclic neural network model which are acquired in real time to obtain a prediction result.
Therefore, the voice interaction method can be ensured to obtain the complement result in time through the personalized model and the global model.
The voice interaction method comprises the following steps: under the condition that the completion result is obtained by carrying out completion on the basis of the prefix tree, processing the completion result to obtain a second prediction instruction; and after receiving the complete user voice request, if the completion result is the same as the received complete user voice request, completing voice interaction according to the second prediction instruction.
Thus, the completion of the incomplete voice request of the user can be completed based on the prefix tree to obtain the completion result, the subsequently received complete voice request is compared with the completion result, and when the received complete voice request is identical with the completion result, the voice interaction can be completed according to the second prediction instruction corresponding to the completion result.
Under the condition that the complete user voice request is not received, predicting the user voice request according to the user voice request and a cyclic neural network model acquired in real time to obtain a prediction result, wherein the method comprises the following steps: under the condition that the complete user voice request is not received, the user voice request acquired in real time is complemented based on a prefix tree to obtain a complement result; and predicting the completion result according to the cyclic neural network model to obtain the prediction result.
Therefore, the incomplete user voice request of the received user can be complemented by the personalized model to obtain a complement result, and then the complement result is input into a Long Short-Term Memory (LSTM) constructed loop to be scored and sequenced to obtain a prediction result, so that the two models are organically fused, and a more accurate prediction result is obtained.
The voice interaction method comprises the following steps: under the condition that the complete user voice request is not received, the user voice request acquired in real time is complemented based on a prefix tree to obtain a complement result; processing the complement result to obtain a third prediction instruction; and after receiving the complete user voice request, if the prediction result or the complement result is the same as the received complete user voice request, completing voice interaction according to the corresponding first prediction instruction or third prediction instruction.
In this way, the received incomplete user voice request is processed through the global model and the personalized model simultaneously to obtain a prediction result and a complement result, and the prediction result and the complement result are respectively processed to obtain a first prediction instruction or a third prediction instruction, so that the prediction instruction can be obtained according to the received incomplete user voice request, and voice interaction is completed.
The voice interaction method comprises the following steps: determining a completion condition by data analysis; and under the condition that the user voice request acquired in real time meets the completion condition, the prefix tree is based on to complete the user voice request acquired in real time to obtain the completion result.
Therefore, the voice interaction method and the voice interaction device can construct instruction prediction closely related to individuals based on the received incomplete user voice request on the basis of the prefix tree, and realize automatic completion of the utterances strongly related to the individuals, so that a completion result is obtained, and the technical effect of thousands of people and thousands of faces capable of identifying personalized utterances of different users is realized.
Under the condition that the user voice request acquired in real time meets the completion condition, the completion result is obtained by completing the user voice request acquired in real time based on the prefix tree, and the method comprises the following steps: under the condition that the user voice request acquired in real time meets the completion condition, determining the voice type of the user voice request according to the user voice request acquired in real time; and selecting the corresponding prefix tree according to the voice type to complement the user voice request acquired in real time to obtain the complement result.
Therefore, the voice interaction method can identify the personalized voice type to which the user voice request belongs according to the user voice request acquired in real time, so that the user voice request is automatically complemented according to the prefix tree corresponding to the voice type to which the user voice request belongs, and the technical effect of thousands of people and thousands of faces which can identify personalized utterances of different users can be achieved.
The voice interaction method comprises the following steps: and after the received complete user voice request is processed to obtain a user voice instruction, adding the complete user voice request to the prefix tree.
Thus, the voice interaction method can update the prefix tree by recording every statement, and ensures the real-time performance of the prefix tree.
The voice interaction method comprises the following steps: acquiring a historical user voice request in a preset time period; and constructing the prefix tree according to the historical user voice request.
Therefore, an initial prefix tree can be constructed according to the historical user voice requests in the preset time period, and a foundation is laid for the follow-up searching of the prefix tree to complement the user voice requests.
The voice interaction method comprises the following steps: setting forgetting duration for the user voice request in the prefix tree; and deleting the corresponding user voice request in the prefix tree under the condition that the unused duration of the user voice request in the prefix tree reaches the forgotten duration.
Thus, forgetting time is set up, and historical voice requests which are longer in time are removed from the prefix tree, so that the storage cost is reduced, and the instantaneity of the prefix tree is ensured.
The voice interaction method comprises the following steps: counting the use frequency, the request length and/or the request duty ratio of the historical user voice requests; and determining the weight corresponding to the user voice request in the prefix tree according to the frequency of use, the request length and/or the request duty ratio.
Therefore, the weight of the historical voice request which is used for a long time before and has low frequency can be reduced, and the real-time performance of the prefix tree is ensured.
The invention also provides a model training method for training the cyclic neural network model for obtaining the voice interaction method according to any one of the above embodiments. The model training method comprises the following steps: acquiring training data; constructing the cyclic neural network model; and training the cyclic neural network by using the training data based on a Bayesian method to obtain the trained cyclic neural network model.
Therefore, the invention carries out voice interaction based on the circulating neural network model trained by the Bayesian method, can accurately and rapidly predict the conventional sentences, and can control the richness, parameter quantity, time cost and cost of the sentences.
The acquiring training input data includes: acquiring a plurality of historical user voice requests; splicing a plurality of historical user voice requests through preset characters to obtain request character strings; and dividing the request character string according to the maximum processing length to obtain the training data.
In this way, training data can be obtained based on the longest processing length.
The step of obtaining the training data by dividing the request character string according to the maximum processing length comprises the following steps: traversing the request character string, and intercepting the request character string from the current character by the maximum processing length to obtain training input data; and intercepting the request character string from the next character of the current character by the maximum processing length to obtain training result data.
Thus, the training input data and the training result data can be used for training the cyclic neural network model, and the trained cyclic neural network model is obtained.
The invention further provides electronic equipment. The electronic device comprises a processor and a memory, wherein the memory stores a computer program, and the computer program is executed by the processor to realize the voice interaction method according to any of the above embodiments.
Therefore, the voice interaction method applied to the electronic equipment predicts and complements the voice request of the user based on the cyclic neural network model, reduces the total time required by the dialogue system to process the voice request of the user, adopts a light-weight model, can accurately and rapidly predict conventional sentences, and can control the richness, the parameter quantity, the time cost and the cost.
The present invention also provides a non-transitory computer readable storage medium containing a computer program. The voice interaction method according to any of the above embodiments is implemented when said computer program is executed by one or more processors.
Therefore, the voice interaction method applied to the storage medium predicts and complements the voice request of the user based on the cyclic neural network model, reduces the total time required by the dialogue system to process the voice request of the user, adopts a light-weight model, can accurately and rapidly predict conventional sentences, and can control the richness, the parameter quantity, the time cost and the cost.
Additional aspects and advantages of embodiments of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of a voice interaction method of the present invention;
FIG. 2 is a schematic diagram of a voice interaction device according to the present invention;
FIG. 3 is a flow chart of the voice interaction method of the present invention;
FIG. 4 is a flow chart of the voice interaction method of the present invention;
FIG. 5 is a flow chart of a voice interaction method of the present invention;
FIG. 6 is a schematic view of a scenario illustrating a prediction process of a prediction result of the voice interaction method of the present invention;
FIG. 7 is a flow chart of a voice interaction method of the present invention;
FIG. 8 is a schematic diagram of a prefix tree structure in the voice interaction method of the present invention;
FIG. 9 is a flow diagram of a processing mechanism of the existing streaming ASR technology framework;
FIG. 10 is a flow diagram of a processing mechanism of an ASR technical framework of the speech interaction method of the present invention;
FIG. 11 is a flow chart of a voice interaction method of the present invention;
FIG. 12 is a flow chart of a voice interaction method of the present invention;
FIG. 13 is a flow chart of a voice interaction method of the present invention;
FIG. 14 is a schematic diagram of a voice interaction device of the present invention;
FIG. 15 is a flow chart of a voice interaction method of the present invention;
FIG. 16 is a flow chart of a voice interaction method of the present invention;
FIG. 17 is a flow chart of a voice interaction method of the present invention;
FIG. 18 is a flow chart of a voice interaction method of the present invention;
FIG. 19 is a flow chart of a voice interaction apparatus of the present invention;
FIG. 20 is a flow chart of a voice interaction method of the present invention;
FIG. 21 is a flow chart of a voice interaction method of the present invention;
FIG. 22 is a flow chart of the model training method of the present invention;
FIG. 23 is a schematic structural view of the model training apparatus of the present invention;
FIG. 24 is a flow chart of the model training method of the present invention;
FIG. 25 is a flow chart of the model training method of the present invention;
FIG. 26 is a schematic view of a training process for training a model in the model training method of the present invention;
fig. 27 is a schematic structural view of an electronic device of the present invention;
fig. 28 is a schematic structural view of a computer-readable storage medium of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the embodiments of the present invention and are not to be construed as limiting the embodiments of the present invention.
Referring to fig. 1, the present invention provides a voice interaction method. The voice interaction method comprises the following steps:
01: acquiring user voice data to perform voice recognition in real time to acquire a user voice request;
02: under the condition that a complete user voice request is not received, predicting the user voice request according to the user voice request obtained in real time and the cyclic neural network model to obtain a prediction result;
03: processing the prediction result to obtain a first prediction instruction;
04: after receiving the complete user voice request, if the prediction result is the same as the received complete user voice request, completing voice interaction according to the first prediction instruction.
Referring to fig. 2, the present invention further provides a voice interaction device 10. The voice interaction apparatus 10 includes: a first fetch module 11, a prediction module 12, an instruction generation module 13 and a comparison module 14.
Step 01 may be implemented by the first acquisition module 11, step 02 may be implemented by the prediction module 12, step 03 may be implemented by the instruction generation module 13, and step 04 may be implemented by the comparison module 14. That is, the first obtaining module 11 is configured to obtain user voice data to perform voice recognition in real time to obtain a user voice request; the prediction module 12 is configured to predict a user voice request according to a user voice request obtained in real time and a recurrent neural network model to obtain a prediction result when a complete user voice request is not received; the instruction generating module 13 is configured to process the prediction result to obtain a first prediction instruction; the comparison module 14 is configured to complete the voice interaction according to the first prediction instruction if the prediction result is the same as the received complete user voice request after the complete user voice request is received.
Specifically, first, user voice data is acquired to perform voice recognition in real time to obtain a user voice request. The user input voice request is a user voice request which is obtained by firstly acquiring user voice data, wherein the user voice data is an audio stream directly input by a user, and then performing real-time voice recognition on the user voice data by utilizing an automatic voice recognition (Automatic Speech Recognition, ASR) technology. It will be appreciated that automatic speech recognition techniques aim to convert lexical content in human speech into computer readable inputs such as keys, binary codes or character sequences. That is, a user voice request that may be computer readable may be obtained by acquiring an audio stream directly input by a user through automatic voice recognition.
And then, under the condition that the complete user voice request is not received, predicting the user voice request according to the user voice request acquired in real time and the cyclic neural network model to obtain a prediction result. A relatively light Long-Short-Term Memory (LSTM) is selected to construct a cyclic neural network (Seq 2 Seq) model to get the overall situation, so that conventional sentences can be accurately and rapidly predicted, and the richness, parameter quantity, time cost and cost of the sentences can be controlled.
And then, processing the predicted result to obtain a first predicted instruction, and after receiving the complete user voice request, completing voice interaction according to the first predicted instruction if the predicted result is the same as the received complete user voice request. That is, before the automatic voice recognition technology returns the complete user voice request, the voice interaction method of the application can predict the obtained user voice request based on the cyclic neural network model to generate a prediction result, and send the prediction result to a subsequent module in advance to process to obtain a first prediction instruction, wherein the subsequent module comprises a natural language understanding (Natural Language Understanding, NLU) module, a dialogue management (Dialogue management, DM) module and an automatic program (BOT) module, after the automatic voice recognition technology returns the complete user voice request, if the prediction result is the same as the received complete user voice request, the subsequent processing of the complete user voice request is not needed, and the total time required by the dialogue system to process the user voice request can be effectively reduced.
Therefore, the voice interaction method of the invention predicts and complements the user voice request based on the cyclic neural network model, reduces the total time required by the dialogue system to process the user voice request, adopts a light-weight model, can accurately and rapidly predict the conventional sentences, and can control the richness, the parameter quantity, the time cost and the cost.
In step 02, the user voice request obtained in real time may be a voice text returned by the automatic voice recognition technology, and the user voice request is predicted by using the recurrent neural network model under the condition that the number of characters of the voice text in the user voice request obtained in real time is not less than 2 characters and less than 10 characters. For example, step 02 may predict a user voice request under the condition that 2 characters are obtained in real time, to obtain a prediction result corresponding to the case that 2 characters are input, then, each time one character is added to the voice text returned by the automatic voice recognition technology in real time, the prediction may be performed again, to obtain a corresponding prediction result under the case that different numbers of characters are input, and a plurality of first prediction instructions are obtained by corresponding processing. Thus, if any one of the plurality of predicted results is the same as the complete user voice request returned by the automatic voice recognition technology, the voice interaction can be completed directly according to the corresponding first predicted instruction. Of course, in the case that a complete user voice request is not received, how to determine whether to make a prediction according to the user voice request acquired in real time may not be limited to the manner discussed above, but may be changed according to actual needs.
It should be noted that, the automatic speech recognition technique may determine whether a complete user speech request is received after a period of continuous listening by waiting for a timeout mechanism, so as to ensure the integrity of the user utterance as much as possible.
Referring to fig. 3, step 02 includes:
021: providing a user voice request acquired in real time as current input to a cyclic neural network model for prediction to obtain a predicted character;
022: under the condition that the predicted character is not a preset character, splicing the predicted character with the current input to obtain the next input, and providing the next input for the cyclic neural network model to predict again;
023: in the case where the predicted character is a preset character, the current input is taken as a predicted result.
Referring to fig. 2, steps 021, 022, 023 can be implemented by the prediction module 12. That is, the prediction module 12 is configured to provide the user voice request obtained in real time as a current input to the recurrent neural network model for prediction to obtain a predicted character; under the condition that the predicted character is not a preset character, splicing the predicted character with the current input to obtain the next input, and providing the next input for the cyclic neural network model to predict again; in the case where the predicted character is a preset character, the current input is taken as a predicted result.
Specifically, under the condition that a complete user voice request is not received, predicting the user voice request according to the user voice request acquired in real time and the cyclic neural network model to obtain a prediction result comprises the following steps:
firstly, a user voice request acquired in real time is used as current input to be provided for a cyclic neural network model to be predicted, so that a predicted character is obtained. For example, if the user voice request acquired in real time is "on", the "on" is provided as the current input to the recurrent neural network model to perform prediction to obtain a predicted character, and as shown in fig. 3, the obtained predicted character is "air conditioner".
And then, under the condition that the predicted character is not a preset character, splicing the predicted character and the current input to obtain the next input, providing the next input for the cyclic neural network model to predict again, and under the condition that the predicted character is the preset character, taking the current input as a predicted result. For example, in fig. 3, when the current input is "open" and the predicted character "null" obtained by prediction by the cyclic neural network model is "EOS", that is, when the predicted character "null" is not the preset character "EOS", the predicted character and the current input may be spliced to obtain the next input to be provided to the cyclic neural network model, that is, when the next input is "open" and provided to the cyclic neural network model to be predicted again to obtain the predicted character "key", and when the predicted character "key" is not the preset character "EOS", the predicted character and the current input may be spliced to obtain the next input to be provided to the cyclic neural network model, that is, when the next input is the preset character, "open air conditioner" is provided to the cyclic neural network model to be predicted again to obtain the predicted character "EOS", that is, the predicted character is the current input "open air conditioner" may be used as the predicted result.
Thus, the incomplete user voice request can be predicted according to the cyclic neural network model to obtain a prediction result.
Referring to fig. 4, step 022 includes:
0221: acquiring confidence coefficient of the predicted character and entropy of the predicted probability distribution;
0222: and under the condition that the confidence coefficient is larger than a first threshold value and the entropy of the prediction probability distribution is smaller than a second threshold value, splicing the predicted character and the current input to obtain the next input, and providing the next input to the cyclic neural network for prediction again.
Referring to fig. 2, steps 0221 and 0222 may be implemented by prediction module 12. That is, the prediction module 12 is configured to obtain a confidence level of the predicted character and entropy of the predicted probability distribution; and under the condition that the confidence coefficient is larger than a first threshold value and the entropy of the prediction probability distribution is smaller than a second threshold value, splicing the predicted character and the current input to obtain the next input, and providing the next input to the cyclic neural network for prediction again.
Specifically, the confidence of the predicted character refers to the accuracy of obtaining whether the predicted character is a character that the user really wants to express. The entropy of the prediction probability distribution is a probability distribution value based on long-short-period memory network prediction, and is used for indicating whether a predicted character based on long-short-period memory network prediction is similar to a character really intended to be expressed by a user, and the smaller the entropy of the prediction probability distribution is, the closer the predicted character of the long-period memory network is to the character really intended to be expressed by the user.
The method comprises the steps of firstly obtaining confidence coefficient of predicted characters and entropy of predicted probability distribution, and then under the condition that the confidence coefficient is larger than a first threshold value and the entropy of the predicted probability distribution is smaller than a second threshold value, namely that the predicted characters are high in accuracy of the characters really intended to be expressed by a user and are similar or identical to the characters really intended to be expressed by the user, splicing the predicted characters and current input to obtain the next input, and providing the next input to a cyclic neural network for prediction again.
For example, the first threshold may be 60%, 65%, 68%, 72%, 75%, 79%, 80%, 82%, 85%, and 90%, and the second threshold may be 0, 0.1, 0.11, 0.15, 0.2, 0.21, 0.22, 0.23, 0.24, and 0.25.
It will be appreciated that if the Entropy (Entropy) of the long-term memory network prediction probability distribution is above the second threshold, the present round of prediction is stopped.
Thus, the invention establishes a double definition that the confidence of the predicted result needs to be higher than a first threshold value, and the entropy of the predicted distribution needs to be lower than a second threshold value, and only if each predicted word meets both conditions at the same time, the two conditions are "approved", so that the number of results with high probability is small, and the throughput is controlled.
Referring to fig. 5, the voice interaction method includes:
024: and under the condition that the number of characters input by the cyclic neural network model is larger than the maximum predicted number of characters, determining that the prediction fails.
Referring to fig. 2, step 024 may be implemented by prediction module 12. That is, the prediction module 12 is configured to determine that the prediction fails in a case where the number of characters input by the recurrent neural network model is greater than the maximum predicted number of characters.
It can be understood that, the more the number of characters input by the user is, the more the time is required, and the more the number of characters input is greater than a certain number of characters, the prediction is unnecessary.
Specifically, the maximum predicted character number of the invention can be 10 characters or other numerical values, and the maximum predicted character number is a numerical value determined according to multiple experiments.
When the maximum predicted character number may be 10 characters, in the case where the number of characters input by the recurrent neural network model is greater than 10 characters, the input characters are not predicted, that is, it is determined that the prediction fails.
When predicting, one-to-one prediction is carried out on a single character, and if the next character of the maximum predicted character number is not predicted yet, [ EOS ] ", a prediction result is not issued.
It should be noted that, the voice interaction method of the present invention may Search for the next character by using a cluster Search algorithm (Beam Search) instead of a greedy algorithm or an exhaustion method. The cluster search algorithm expands the search space, but is far from exhaustive searching the exponential search space, and is a compromise algorithm of both greedy algorithm and exhaustive method.
Therefore, the total duration required by the dialogue system for processing the user voice request can be reduced, and the prediction efficiency of the cyclic neural network model is improved.
Referring to fig. 7, step 02 includes:
025: under the condition that a complete user voice request is not received, supplementing the user voice request acquired in real time based on the prefix tree;
026: under the condition that the completion result is not obtained by the completion based on the prefix tree, the user voice request is predicted according to the user voice request obtained in real time and the cyclic neural network model to obtain a prediction result.
Referring to fig. 2, steps 025 and 026 may be implemented by prediction module 12. That is, the prediction module 12 is configured to complement the user voice request acquired in real time based on the prefix tree in the case that the complete user voice request is not received; under the condition that the completion result is not obtained by the completion based on the prefix tree, the user voice request is predicted according to the user voice request obtained in real time and the cyclic neural network model to obtain a prediction result.
It will be appreciated that the global model mainly reflects the general usage of the whole user, while for the user's unique habit, in order to improve accuracy while guaranteeing speed and throughput, the relevant sentences of this part may be listed separately, and specifically solved using a personalized model. Therefore, the invention can use the global model to fuse the personalized model to complete the user voice request. The global model is trained based on the corpus of voice requests of the full users and is mainly used for predicting global common effective instructions. The personalized model is constructed based on the corpus of the user voice request of the user individual and is used for instruction prediction of navigation, music, telephone, high frequency and the like closely related to the individual. The global model of the invention is a cyclic neural network model constructed based on the long-term and short-term memory network. The personalized model is a model constructed based on a prefix tree algorithm.
Referring to table 1, personalized user voice requests can be divided into two types, one is closed type user voice data real-time recognition to obtain corresponding user voice requests such as "open vehicle state". The other is that the open user voice data is identified in real time to obtain a corresponding user voice request, such as 'navigation to the holiday square in the Yitian', and the open user voice request is shown to be open at the slot.
TABLE 1
Figure BDA0003591032350000111
Specifically, firstly, user voice data is acquired to perform voice recognition in real time to obtain a user voice request, so that personalized user voice requests meeting the conditions are recognized, and personalized voice demands of different users are met. The voice request input by the user is a user voice request which is obtained by firstly acquiring user voice data, wherein the user voice data is an audio stream directly input by the user, and then performing real-time voice recognition on the user voice data by utilizing an automatic voice recognition technology. It will be appreciated that automatic speech recognition techniques aim to convert lexical content in human speech into computer readable inputs such as keys, binary codes or character sequences.
And then, under the condition that the complete user voice request is not received, complementing the user voice request acquired in real time based on the prefix tree to obtain a complement result. That is, when the user voice request is obtained in real time and the complete user voice request is not received, once the incomplete user voice request is identified to meet the personalized user voice request in any aspect, the corresponding sentence can be searched in the prefix tree to complete to obtain a complete result, and the complete result can be processed to obtain the predicted voice instruction.
The prefix tree algorithm is simple and easy to implement, and through the tree structure, each statement prefix is used as a node to store the statement, so that the statement searching is quick and convenient. The prefix tree may be as shown in fig. 8, with the nodes in the prefix tree of fig. 8 being "day" and "song".
It can be understood that, since the prefix tree can complement the user voice request acquired in real time, that is, the invention updates the prefix tree in real time by recording the sentences input by the user every day, the real-time performance of the prefix tree can be ensured.
After receiving the complete user voice request, if the completion result is the same as the received complete user voice request, completing voice interaction according to the predicted voice instruction. That is, when a complete user voice request is received, the completion result can be compared with the complete user voice request, and if the sentences of the voice requests of the two are identical or the sentences have deviation but the meanings expressed by the two are identical, the completion result can be considered to be identical with the received complete user voice request.
Therefore, when the completion result is the same as the received complete user voice request, voice interaction can be directly completed according to the predicted voice command, so that the technical effects of fast generation speed and high accuracy of the predicted voice command are achieved by utilizing the predicted user intention in advance in the normal speaking time of the user, and the voice interaction process is quickened on the time level.
It will be appreciated that referring to fig. 9, the processing mechanism of the current streaming ASR technical framework is to recognize the voice information input by the user frame by frame, and return the recognition result in real time, and finally have a waiting timeout mechanism for continuous listening, so as to ensure the integrity of the user utterance as much as possible. As shown in fig. 9, the available time of the processing mechanism of the current streaming ASR technical framework is the time that the user voice request obtains to the time that the ASR timeout waits for to end, which is originally consumed, while the maximum first-aid profit time is the time that the natural language understanding (Natural Language Understanding, NLU) module, the dialogue management (Dialogue management, DM) module and the automated program (BOT) module process, i.e., the time that the original voice request of nlu+dm+bot should process.
Referring to fig. 10, before the ASR returns the complete sentence, the voice interaction method of the present invention complements the obtained user voice request based on the prefix tree, and sends the user voice request to the subsequent module in advance for processing, so that the total duration required by the dialogue system for processing the user command can be effectively reduced. That is, the voice interaction method of the application uses the user voice instruction with low completeness to accurately predict the expression result of the user, and the predicted expression result or the semantic meaning of the complete voice request of the user are regarded as correct prediction, so that the total time required by the dialogue system to process the user instruction can be effectively reduced.
In addition, under the condition that the completion result is not obtained by the completion based on the prefix tree, the prediction result is obtained by predicting the user voice request according to the user voice request and the cyclic neural network model which are obtained in real time, so that the prediction result can be obtained by predicting the user voice request according to the cyclic neural network model in time when the completion result is not obtained by the prefix tree.
That is, the personalized model can be prioritized to complement the incomplete voice request of the user, and the global model is used for complement to obtain the complement result under the condition that the complement result cannot be obtained through the personalized model.
Therefore, the voice interaction method can be ensured to obtain the complement result in time through the personalized model and the global model.
Referring to fig. 11, the voice interaction method includes:
027: under the condition that the completion result is obtained by completion based on the prefix tree, the completion result is processed to obtain a second prediction instruction;
028: after the complete user voice request is received, if the completion result is the same as the received complete user voice request, the voice interaction is completed according to the second prediction instruction.
Referring to fig. 2, steps 027 and 028 may be implemented by prediction module 12. That is, the prediction module 12 is configured to process the completion result to obtain a second prediction instruction when the completion result is obtained by performing the completion based on the prefix tree; after the complete user voice request is received, if the completion result is the same as the received complete user voice request, the voice interaction is completed according to the second prediction instruction.
Specifically, the user voice request is sent to the server according to frames, the streaming ASR on the server recognizes each frame and broadcasts the recognition result in real time, and the electronic device can perform after receiving the broadcast recognition result each time: (1) condition interception: the word number is limited to 2-10; (2) And sending an incomplete text to the speech completion module to request completion, and if the completion result is output, sending the completion result to a natural language understanding (Natural Language Understanding, NLU) module, a dialogue management (Dialogue management, DM) module and an automatic program (BOT) module for relevant processing to obtain a second prediction instruction.
And finally, when the streaming ASR module receives the complete user voice request, comparing the completion result with the complete voice request, and if the completion result is consistent with the result determined by the complete voice request after natural language understanding, considering that the completion is successful, and directly submitting a second prediction instruction obtained by processing the completion result to perform voice interaction.
If the completion result is inconsistent with the result determined by the complete voice request after natural language understanding, the completion is considered to be failed, and the original processing flow is normally carried out.
Thus, the completion of the incomplete voice request of the user can be completed based on the prefix tree to obtain the completion result, the subsequently received complete voice request is compared with the completion result, and when the received complete voice request is identical with the completion result, the voice interaction can be completed according to the second prediction instruction corresponding to the completion result.
Referring to fig. 12, step 02 includes:
0291: under the condition that a complete user voice request is not received, the user voice request acquired in real time is complemented based on the prefix tree to obtain a complement result;
0292: and predicting the completion result according to the cyclic neural network model to obtain a prediction result.
Referring to fig. 2, steps 0291 and 0292 may be implemented by prediction module 12. That is, the prediction module 12 is configured to complement the user voice request acquired in real time based on the prefix tree to obtain a complement result when the complete user voice request is not received; and predicting the completion result according to the cyclic neural network model to obtain a prediction result.
Specifically, first, under the condition that a complete user voice request is not received, the user voice request acquired in real time can be complemented based on the prefix tree to obtain a complement result. And then, predicting the completion result according to the cyclic neural network model to obtain a prediction result.
Therefore, the incomplete user voice request of the received user can be complemented by the personalized model to obtain a complement result, and then the complement result is input into a cycle constructed by the long-term and short-term memory network to be scored and sequenced to obtain a prediction result, so that the two models are organically fused, and a more accurate prediction result is obtained.
It should be noted that, regarding the ranking preference of the completion result, the ranking selection preference of the completion result may be performed by combining the current vehicle information, such as the current time (early/middle/late, weekday/weekend, season/year), the vehicle state (start/travel/stop), the user voice request of the previous round, or the large screen state (navigation interface/music interface).
That is, the voice interaction method of the invention selectively screens the data of the complement results according to the intended target complement results, and can find the difference between the screened complement results and the complement scheme generated by the common voice text, thereby forming the unique complement technical scheme in the specific field.
In addition, the voice interaction method organically fuses the global model and the personalized model, has strong scheme operability, low required equipment requirement, easy realization and obvious effect.
Referring to fig. 13, the voice interaction method includes:
05: under the condition that a complete user voice request is not received, the user voice request acquired in real time is complemented based on the prefix tree to obtain a complement result;
06: processing the complement result to obtain a third prediction instruction;
07: after receiving the complete user voice request, if the prediction result or the complement result is the same as the received complete user voice request, completing voice interaction according to the corresponding first prediction instruction or third prediction instruction.
Referring to fig. 14, the voice interaction device 10 further includes a complement module 15.
Step 05 may be implemented by the complement module 15, step 06 may be implemented by the instruction generation module 13, and step 07 may be implemented by the comparison module 14. That is, the complementing module 15 is configured to complement the user voice request acquired in real time based on the prefix tree to obtain a complementing result when the complete user voice request is not received; the instruction generating module 13 is configured to process the complement result to obtain a third prediction instruction; the comparison module 14 is configured to complete the voice interaction according to the corresponding first prediction instruction or third prediction instruction if the prediction result or the completion result is the same as the received complete user voice request after the complete user voice request is received.
Specifically, the received incomplete user voice request is processed through the global model and the personalized model simultaneously to obtain a prediction result and a complement result, and the prediction result and the complement result are processed respectively to obtain a first prediction instruction or a third prediction instruction.
Then, after receiving the complete user voice request, if the predicted result or the complement result is the same as the received complete user voice request, completing the voice interaction according to the corresponding first predicted instruction or third predicted instruction means:
If the prediction result is the same as the received complete user voice request, completing voice interaction according to the corresponding first prediction instruction. And if the completion result is the same as the received complete user voice request, complete voice interaction is performed according to the corresponding third prediction instruction. If the predicted result and the complement result are the same as the received complete user voice request, completing voice interaction according to any one instruction of the first predicted instruction or the third predicted instruction.
In this way, the received incomplete user voice request is processed through the global model and the personalized model simultaneously to obtain a prediction result and a complement result, and the prediction result and the complement result are respectively processed to obtain a first prediction instruction or a third prediction instruction, so that the prediction instruction can be obtained according to the received incomplete user voice request, and voice interaction is completed.
Referring to fig. 15, the voice interaction method includes:
0251: determining a completion condition by data analysis;
0252: and under the condition that the user voice request acquired in real time meets the completion condition, the user voice request acquired in real time is completed based on the prefix tree to obtain a completion result.
Referring to fig. 2, steps 0251 and 0252 may be implemented by prediction module 12. That is, the prediction module 12 is configured to determine the completion condition through data analysis; and under the condition that the user voice request acquired in real time meets the completion condition, the user voice request acquired in real time is completed based on the prefix tree to obtain a completion result.
Specifically, the completion condition is determined by data analysis. For example, the completion condition may be: the number of words obtained ranges from 2 to 10. The range of words is a range determined from the on-line data analysis.
And under the condition that the user voice request acquired in real time meets the completion condition, the user voice request acquired in real time is completed based on the prefix tree to obtain a completion result. That is, when the number of words of the user voice request acquired in real time reaches 2 or 3, 4, 5, 6, 7, 8, 9 or 10, the incomplete user voice request can be complemented based on the prefix tree to obtain a complement result, and of course, the word number range can also be changed according to actual needs.
For example, referring to fig. 8, when the number of words of the user voice request acquired in real time is 2 and the user voice request is "day no", the user voice request may be completed based on the prefix tree in fig. 8 to obtain the completion result as "day no drop".
Therefore, the voice interaction method can construct instruction prediction closely related to individuals based on the prefix tree for the received incomplete user voice request, and realize automatic completion of the utterances strongly related to the individuals, so that a completion result is obtained, and the technical effect of thousands of people and thousands of faces for identifying personalized utterances of different users is realized.
Referring to fig. 16, step 0252 includes:
02521: under the condition that the user voice request acquired in real time meets the completion condition, determining the voice type of the user voice request according to the user voice request acquired in real time;
02522: and selecting a corresponding prefix tree according to the voice type to complement the user voice request acquired in real time to obtain a complement result.
Referring to fig. 2, steps 02521 and 02522 may be implemented by the prediction module 12. That is, the prediction module 12 is configured to determine a voice type of the user voice request according to the user voice request acquired in real time, in a case where the user voice request acquired in real time satisfies the completion condition; and selecting a corresponding prefix tree according to the voice type to complement the user voice request acquired in real time to obtain a complement result.
Specifically, the personalized voice types of the user voice request may include 4 types as follows, navigation-type voice, music-type voice, phone-type voice, and high-frequency-type voice, respectively.
The sentence prefix tree corresponding to each of the 4 types of voice types may be configured, for example, as follows:
navigation class: i (want) (# poi#) to (around |around |around|). That is to say that, the sentence prefix of the navigation class can be "I want", "I want" to go around ", I want" and "I want" to go around; "I want to surround", "I want to go around", "I want to go nearby", "I want to go around # POI#".
Music class: (i want to listen to the song of i want to listen to i play i search) one next down (slot value). That is, the sentence prefix of the music class may be "i want to listen", "i want to play", "i want to put", "i want to search", "i want to listen to that", "i want to listen to a slot value", "i want to listen to a song of a slot value".
Telephone class: (help me give |replace me give |me want give) (# specify #) (store |experience store) (make a phone |make phone). That is, the sentence prefix of the phone class may be "help me give", "replace me give", "me want to give", "give # designated # store", "help me make a phone call to # designated # store".
High frequency class: the prefix in the statement that occurs more frequently over a period of time, e.g. "first", "turn on air conditioner".
For example, when the acquired user voice data is subjected to real-time voice recognition to obtain that the user voice request is "i want to listen", the voice type which is the music type can be obtained according to the real-time acquired user voice request. And selecting a corresponding prefix tree according to the music type to automatically complement the user voice request 'I want to listen' acquired in real time, so that a completion result can be 'I want to listen to a song list of slot value'.
Therefore, the voice interaction method can identify the personalized voice type to which the user voice request belongs according to the user voice request acquired in real time, so that the user voice request is automatically complemented according to the prefix tree corresponding to the voice type to which the user voice request belongs, and the technical effect of thousands of people and thousands of faces which can identify personalized utterances of different users can be achieved.
Referring to fig. 17, the voice interaction method includes:
02523: after the received complete user voice request is processed to obtain a user voice instruction, the complete user voice request is added to the prefix tree.
Referring to fig. 2, step 02523 may be implemented by prediction module 12. That is, the prediction module 12 is configured to add the complete user voice request to the prefix tree after processing the received complete user voice request to obtain the user voice command.
Specifically, when the received complete user voice request is processed to obtain the user voice command to complete voice interaction, the prefix tree-based personalized model and the cyclic neural network-based global model are described as being used for complementing or failing in prediction of the incomplete user voice request, and the complementing result and the predicting result are different from the complete user voice request, so that the complete user voice request can be processed to obtain the user voice command to complete voice interaction after the complete user voice request is received.
That is, after receiving the complete user voice request, the complete user voice request may be added to the prefix tree after the received complete user voice request is processed to obtain the user voice command, so as to implement real-time update of the prefix tree.
Thus, the voice interaction method can update the prefix tree by recording every statement, and ensures the real-time performance of the prefix tree.
Referring to fig. 18, the voice interaction method includes:
001: acquiring a historical user voice request in a preset time period;
002: and constructing a prefix tree according to the historical user voice request.
Referring to fig. 19, the voice interaction device 10 includes a prefix tree construction module 111.
Step 001 and step 002 may be implemented by prefix tree construction module 111. That is, the prefix tree construction module 111 is configured to obtain a historical user voice request within a preset period of time; and constructing a prefix tree according to the historical user voice request. Wherein step 001 and step 002 may occur prior to step 01 or step 02, without limitation herein.
Specifically, first, historical user voice requests may be recorded in days (or other units of time). Then, the historical user voice request in a preset time period is obtained, wherein the preset time period can be a time period range of 7 days to 30 days before the current time, namely, the historical user voice request of 7 days to 30 days before is obtained, and the instantaneity of a prefix tree constructed by the user voice request can be ensured.
Therefore, according to the historical user voice requests of the previous 7-30 days, the historical voice requests of each person in four types (music, navigation, telephone and high frequency) can be recorded through strategy matching, and then initial prefix trees are respectively constructed.
Therefore, an initial prefix tree can be constructed according to the historical user voice requests in the preset time period, and a foundation is laid for the follow-up searching of the prefix tree to complement the user voice requests.
Referring to fig. 20, the voice interaction method includes:
003: setting forgetting duration for a user voice request in a prefix tree;
004: and deleting the corresponding user voice request in the prefix tree under the condition that the forgetting duration is reached when the user voice request in the prefix tree is not used.
Referring to fig. 19, steps 003 and 004 may be implemented by the prefix tree construction module 111. That is, the prefix tree construction module 111 is configured to set a forgetting duration for a user voice request in the prefix tree; and deleting the corresponding user voice request in the prefix tree under the condition that the forgetting duration is reached when the user voice request in the prefix tree is not used. Step 003 and step 004 may occur before step 01 or step 02 and after step 002.
Specifically, the forgetting time period may be, for example, 24 hours, 48 hours, 3 days, 5 days, 7 days, 10 days, 11 days, 12 days, or 30 days, without limitation. The forgetting duration can be set by the user according to the user demand.
Thus, forgetting time is set up, and historical voice requests which are longer in time are removed from the prefix tree, so that the storage cost is reduced, and the instantaneity of the prefix tree is ensured.
Referring to fig. 21, the voice interaction method includes:
005: counting the use frequency, the request length and/or the request duty ratio of the voice requests of the historical users;
006: and determining the weight of the corresponding user voice request in the prefix tree according to the frequency of use, the request length and/or the request duty ratio.
Referring to fig. 19, steps 005 and 006 may be implemented by the prefix tree construction module 111. That is, the prefix tree construction module 111 is configured to count the frequency of use, the request length, and/or the request duty cycle of the voice requests of the historical user; and determining the weight of the corresponding user voice request in the prefix tree according to the frequency of use, the request length and/or the request duty ratio. Step 005 and step 006 may occur before step 01 or step 02 and after step 002.
Specifically, the use frequency, the request length and/or the request duty ratio of the historical user voice requests are counted, and the weight of the corresponding user voice requests in the prefix tree is determined according to the use frequency, the request length and/or the request duty ratio. That is, the weights of the corresponding user voice requests in the prefix tree may be determined by one or more of statistics of the frequency of use of the user voice requests, the request length of the user voice requests, or the request duty ratio of the user voice requests.
In other words, the order of the user voice requests may be rearranged or the weight of the user voice requests may be changed using statistical information of the historical voice requests, such as frequency, length, duty ratio, etc.
Therefore, the weight of the historical voice request which is used for a long time before and has low frequency can be reduced, and the real-time performance of the prefix tree is ensured.
The training method of the cyclic neural network model according to the present invention will be described in detail.
Referring to fig. 22, the present invention provides a model training method for training a recurrent neural network model according to the voice interaction method of any one of the above embodiments. The model training method comprises the following steps:
22: acquiring training data;
24: constructing a cyclic neural network model;
26: based on a Bayesian method, training the cyclic neural network by using training data to obtain a trained cyclic neural network model.
Referring to fig. 23, the present invention further provides a model training apparatus 20. The model training apparatus 20 includes a second acquisition module 21, a model construction module 22, and a model training module 23.
Step 22 may be implemented by the second acquisition module 21, step 24 may be implemented by the model building module 22, and step 26 may be implemented by the model training module 23. That is, the second acquisition module 21 is configured to acquire training data; the model construction module 22 is used for constructing a cyclic neural network model; the model training module 23 is configured to train the recurrent neural network based on a bayesian method by using training data to obtain a trained recurrent neural network model.
Specifically, firstly, training data is acquired, wherein the training data can be a training user voice request obtained by processing user random input voice data through an automatic voice recognition technology, for example, the training user voice request can be a voice request of 'turning on an air conditioner and then navigating home', 'turning on a rearview mirror', and the like.
Then, a recurrent neural network model is constructed based on the acquired training user voice request. That is, the model training method of the invention constructs an initial framework of the recurrent neural network model based on the light-weight but good-effect long-term memory network.
And finally, training the cyclic neural network by using training data based on a Bayesian method to obtain a trained cyclic neural network model. Compared with manual parameter tuning, grid searching and random searching, the method uses more scientific parameter tuning based on a Bayesian method, establishes a probability model of an objective function, and uses the probability model to select optimal super parameters to evaluate a real objective function.
The process of evaluating the real objective function is as follows: first, a probability model of a surrogate objective function, i.e., surrogate function, is built. Then, the optimal hyper-parameters are found that maximize the probability of the substitution function. And substituting the found hyper-parameters into a machine learning model for training to obtain the original objective function score. And updating the prior distribution (x, y) of the objective function by using the original objective function score. Finally, repeating the steps 2-4 until the maximum number of iterations or the maximum duration is reached.
Therefore, the invention carries out voice interaction based on the circulating neural network model trained by the Bayesian method, can accurately and rapidly predict the conventional sentences, and can control the richness, parameter quantity, time cost and cost of the sentences.
Referring to fig. 24, step 22 includes:
221: acquiring a plurality of historical user voice requests;
222: splicing a plurality of historical user voice requests through preset characters to obtain request character strings;
223: and dividing the request character string according to the maximum processing length to obtain training data.
Referring to fig. 22, steps 221, 222, and 223 may be implemented by model building module 22. That is, the model building module 22 is configured to obtain a plurality of historical user voice requests; splicing a plurality of historical user voice requests through preset characters to obtain request character strings; and dividing the request character string according to the maximum processing length to obtain training data.
Specifically, first, a plurality of historical user voice requests are acquired, for example, n historical user voice requests within the first 7 days of the current time are acquired: query_1, query_2, query_3 … query_n.
Then, splicing all the Query with a preset character [ EOS ]' to obtain a request character string: query_1+ [ EOS ] +query_2+ [ EOS ] +. Query_n.
Assuming that the total length of the request character string is N, giving the longest processing length L of the request character string, and obtaining training data by dividing the request character string according to the longest processing length L.
In this way, training data can be obtained based on the longest processing length.
Referring to fig. 25, step 223 includes:
2231: traversing the request character string, and intercepting the request character string from the current character to obtain training input data with the maximum processing length;
2232: and intercepting the request character string from the next character of the current character with the maximum processing length to obtain training result data.
Referring to fig. 22, steps 2231 and 2232 may be implemented by model building block 22. That is, the model building module 22 is configured to traverse the request string, intercept the request string from the current character with the maximum processing length, and obtain training input data; and intercepting the request character string from the next character of the current character with the maximum processing length to obtain training result data.
Specifically, for example, for a request string obtained by concatenation: query_1+ [ EOS ] +query_2+ [ EOS ] +. Query_n.
The request string may be traversed, intercepting the request string from a current character at a maximum processing length to obtain training input data x= [ i, i+l-1], intercepting the request string from a next character of the current character at a maximum processing length to obtain training result data, i.e., obtaining training result data y= [ i+1, i+l ], for i=1, (N-1).
That is, the training data includes training input data and training result data, and the training process is that if x= [ i, i+l-1] and the corresponding label y= [ i+1, i+l ] for i=1, (N-1) are input into the recurrent neural network model, the training result can be output from the recurrent neural network model, as shown in fig. 26.
Thus, the training input data and the training result data can be used for training the cyclic neural network model, and the trained cyclic neural network model is obtained.
Referring to fig. 27, the present invention further provides an electronic device 30. The electronic device 30 comprises a processor 31 and a memory 32, the memory 32 having stored thereon a computer program 321, which when executed by the processor 31, implements the voice interaction method described in any of the embodiments above. Electronic device 30 includes, but is not limited to, a vehicle, a cell phone, an ipad, etc., without limitation.
Therefore, the voice interaction method applied to the electronic equipment 30 of the invention predicts and complements the user voice request based on the cyclic neural network model, reduces the total time required by the dialogue system to process the user voice request, adopts a light weight model, can accurately and rapidly predict conventional sentences, and can control the richness, the parameter quantity, the time cost and the cost.
Referring to fig. 28, the present invention also provides a non-transitory computer readable storage medium 40 containing a computer program. The voice interaction method described in any of the embodiments above is implemented when the computer program 41 is executed by the one or more processors 50.
For example, the computer program 41, when executed by the processor 50, implements the steps of the following voice interaction method:
01: acquiring user voice data to perform voice recognition in real time to acquire a user voice request;
02: under the condition that a complete user voice request is not received, the user voice request acquired in real time is complemented based on the prefix tree to obtain a complement result;
03: processing the complement result to obtain a predicted voice instruction;
04: after receiving the complete user voice request, if the completion result is the same as the received complete user voice request, completing voice interaction according to the predicted voice instruction.
It will be appreciated that the computer program 41 comprises computer program code. The computer program code may be in the form of source code, object code, executable files, or in some intermediate form, among others. The computer readable storage medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a software distribution medium, and so forth.
The voice interaction method applied by the computer-readable storage medium 40 of the invention predicts and complements the user voice request based on the cyclic neural network model, reduces the total time required by the dialogue system to process the user voice request, adopts a light-weight model, can accurately and rapidly predict conventional sentences, and can control the richness, the parameter quantity, the time cost and the cost.

Claims (18)

1. A method of voice interaction, comprising:
acquiring user voice data to perform voice recognition in real time to acquire a user voice request;
under the condition that the complete user voice request is not received, providing the user voice request acquired in real time as current input for a cyclic neural network model to predict to obtain a predicted character;
under the condition that the predicted character is not a preset character, splicing the predicted character and the current input to obtain a next input, and providing the next input for the cyclic neural network model to predict again;
taking the current input as a prediction result of the user voice request under the condition that the predicted character is a preset character;
processing the prediction result to obtain a first prediction instruction;
And after receiving the complete user voice request, if the prediction result is the same as the received complete user voice request, completing voice interaction according to the first prediction instruction.
2. The voice interaction method according to claim 1, wherein, in the case that the predicted character is not a preset character, the splicing the predicted character and the current input to obtain a next input is provided to the recurrent neural network to predict again, including:
acquiring confidence coefficient of the predicted character and entropy of the predicted probability distribution;
and under the condition that the confidence coefficient is larger than a first threshold value and the entropy of the prediction probability distribution is smaller than a second threshold value, splicing the predicted character and the current input to obtain a next input, and providing the next input to the recurrent neural network for prediction again.
3. The voice interaction method according to claim 1, wherein the voice interaction method comprises:
and under the condition that the number of characters input by the cyclic neural network model is larger than the maximum predicted number of characters, determining that the prediction fails.
4. The voice interaction method according to claim 1, wherein the voice interaction method comprises:
Under the condition that the complete user voice request is not received, completing the user voice request acquired in real time based on a prefix tree;
and under the condition that the completion result is not obtained by the completion based on the prefix tree, predicting the user voice request according to the user voice request and the cyclic neural network model which are acquired in real time to obtain the prediction result.
5. The voice interaction method according to claim 4, wherein the voice interaction method comprises:
under the condition that the completion result is obtained by carrying out completion on the basis of the prefix tree, processing the completion result to obtain a second prediction instruction;
and after receiving the complete user voice request, if the completion result is the same as the received complete user voice request, completing voice interaction according to the second prediction instruction.
6. The voice interaction method according to claim 1, wherein the voice interaction method comprises:
under the condition that the complete user voice request is not received, the user voice request acquired in real time is complemented based on a prefix tree to obtain a complement result;
And predicting the completion result according to the cyclic neural network model to obtain the prediction result.
7. The voice interaction method according to claim 1, wherein the voice interaction method comprises:
under the condition that the complete user voice request is not received, the user voice request acquired in real time is complemented based on a prefix tree to obtain a complement result;
processing the complement result to obtain a third prediction instruction;
and after receiving the complete user voice request, if the prediction result or the complement result is the same as the received complete user voice request, completing voice interaction according to the corresponding first prediction instruction or third prediction instruction.
8. The voice interaction method according to any one of claims 4-7, wherein the voice interaction method comprises:
determining a completion condition by data analysis;
and under the condition that the user voice request acquired in real time meets the completion condition, the prefix tree is based on to complete the user voice request acquired in real time to obtain the completion result.
9. The voice interaction method according to claim 8, wherein, in the case where the user voice request acquired in real time satisfies the completion condition, the completion result is obtained by completing the user voice request acquired in real time based on the prefix tree, including:
Under the condition that the user voice request acquired in real time meets the completion condition, determining the voice type of the user voice request according to the user voice request acquired in real time;
and selecting the corresponding prefix tree according to the voice type to complement the user voice request acquired in real time to obtain the complement result.
10. The voice interaction method according to claim 8, wherein the voice interaction method comprises:
and after the received complete user voice request is processed to obtain a user voice instruction, adding the complete user voice request to the prefix tree.
11. The voice interaction method according to claim 8, wherein the voice interaction method comprises:
acquiring a historical user voice request in a preset time period;
and constructing the prefix tree according to the historical user voice request.
12. The voice interaction method according to claim 11, wherein the voice interaction method comprises:
setting forgetting duration for the user voice request in the prefix tree;
and deleting the corresponding user voice request in the prefix tree under the condition that the unused duration of the user voice request in the prefix tree reaches the forgotten duration.
13. The voice interaction method according to claim 11, wherein the voice interaction method comprises:
counting the use frequency, the request length and/or the request duty ratio of the historical user voice requests;
and determining the weight corresponding to the user voice request in the prefix tree according to the frequency of use, the request length and/or the request duty ratio.
14. Model training method for training the recurrent neural network model resulting in the speech interaction method of any of claims 1-13 for the speech interaction method of any of claims 1-13, comprising:
acquiring training data;
constructing the cyclic neural network model;
and training the cyclic neural network by using the training data based on a Bayesian method to obtain the trained cyclic neural network model.
15. The model training method of claim 14, wherein the acquiring training input data comprises:
acquiring a plurality of historical user voice requests;
splicing a plurality of historical user voice requests through preset characters to obtain request character strings;
and dividing the request character string according to the maximum processing length to obtain the training data.
16. The model training method of claim 15, wherein the segmenting the request string according to the maximum processing length results in the training data, comprising:
traversing the request character string, and intercepting the request character string from the current character by the maximum processing length to obtain training input data;
and intercepting the request character string from the next character of the current character by the maximum processing length to obtain training result data.
17. An electronic device comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements the voice interaction method of any of claims 1-13.
18. A non-transitory computer readable storage medium containing a computer program, characterized in that the voice interaction method of any of claims 1-13 is implemented when the computer program is executed by one or more processors.
CN202210378175.0A 2022-04-12 2022-04-12 Voice interaction method, model training method, electronic device and storage medium Active CN114822533B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210378175.0A CN114822533B (en) 2022-04-12 2022-04-12 Voice interaction method, model training method, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210378175.0A CN114822533B (en) 2022-04-12 2022-04-12 Voice interaction method, model training method, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN114822533A CN114822533A (en) 2022-07-29
CN114822533B true CN114822533B (en) 2023-05-12

Family

ID=82534048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210378175.0A Active CN114822533B (en) 2022-04-12 2022-04-12 Voice interaction method, model training method, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN114822533B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115083413B (en) * 2022-08-17 2022-12-13 广州小鹏汽车科技有限公司 Voice interaction method, server and storage medium
CN116110396B (en) * 2023-04-07 2023-08-29 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021164244A1 (en) * 2020-02-18 2021-08-26 百度在线网络技术(北京)有限公司 Voice interaction method and apparatus, device and computer storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0905457D0 (en) * 2009-03-30 2009-05-13 Touchtype Ltd System and method for inputting text into electronic devices
US8996550B2 (en) * 2009-06-03 2015-03-31 Google Inc. Autocompletion for partially entered query
US11061948B2 (en) * 2016-09-22 2021-07-13 Verizon Media Inc. Method and system for next word prediction
KR20210016767A (en) * 2019-08-05 2021-02-17 삼성전자주식회사 Voice recognizing method and voice recognizing appratus
CN113946719A (en) * 2020-07-15 2022-01-18 华为技术有限公司 Word completion method and device
CN113571064B (en) * 2021-07-07 2024-01-30 肇庆小鹏新能源投资有限公司 Natural language understanding method and device, vehicle and medium
CN113284496B (en) * 2021-07-22 2021-10-12 广州小鹏汽车科技有限公司 Voice control method, voice control system, vehicle, server, and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021164244A1 (en) * 2020-02-18 2021-08-26 百度在线网络技术(北京)有限公司 Voice interaction method and apparatus, device and computer storage medium

Also Published As

Publication number Publication date
CN114822533A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN114822533B (en) Voice interaction method, model training method, electronic device and storage medium
US11676575B2 (en) On-device learning in a hybrid speech processing system
CN109616108B (en) Multi-turn dialogue interaction processing method and device, electronic equipment and storage medium
Kreyssig et al. Neural user simulation for corpus-based policy optimisation for spoken dialogue systems
US6505162B1 (en) Apparatus and method for portable dialogue management using a hierarchial task description table
US10289717B2 (en) Semantic search apparatus and method using mobile terminal
CN110245221B (en) Method and computer device for training dialogue state tracking classifier
CN110473531A (en) Audio recognition method, device, electronic equipment, system and storage medium
CN109741735B (en) Modeling method, acoustic model acquisition method and acoustic model acquisition device
CN109299245B (en) Method and device for recalling knowledge points
US11200885B1 (en) Goal-oriented dialog system
US11605376B1 (en) Processing orchestration for systems including machine-learned components
KR20180107940A (en) Learning method and apparatus for speech recognition
US10872601B1 (en) Natural language processing
CN114822532A (en) Voice interaction method, electronic device and storage medium
Lee et al. Interactive spoken content retrieval by deep reinforcement learning
CN111813916B (en) Intelligent question-answering method, device, computer equipment and medium
US10140981B1 (en) Dynamic arc weights in speech recognition models
CN111427444B (en) Control method and device of intelligent device
CN116863927A (en) Vehicle-mounted multimedia voice instruction processing method and device and electronic equipment
CN116340470A (en) Keyword associated retrieval system based on AIGC
EP4322066A1 (en) Method and apparatus for generating training data
CN113377943B (en) Multi-round intelligent question-answering data processing system
CN116151235A (en) Article generating method, article generating model training method and related equipment
CN113010788B (en) Information pushing method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant