CN114822533A - Voice interaction method, model training method, electronic device and storage medium - Google Patents

Voice interaction method, model training method, electronic device and storage medium Download PDF

Info

Publication number
CN114822533A
CN114822533A CN202210378175.0A CN202210378175A CN114822533A CN 114822533 A CN114822533 A CN 114822533A CN 202210378175 A CN202210378175 A CN 202210378175A CN 114822533 A CN114822533 A CN 114822533A
Authority
CN
China
Prior art keywords
user voice
request
voice request
prediction
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210378175.0A
Other languages
Chinese (zh)
Other versions
CN114822533B (en
Inventor
李万水
陈光毅
翁志伟
孙仿逊
李晨延
赵耀
易晖
李嘉辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Xiaopeng Motors Technology Co Ltd
Original Assignee
Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Xiaopeng Motors Technology Co Ltd filed Critical Guangzhou Xiaopeng Motors Technology Co Ltd
Priority to CN202210378175.0A priority Critical patent/CN114822533B/en
Publication of CN114822533A publication Critical patent/CN114822533A/en
Application granted granted Critical
Publication of CN114822533B publication Critical patent/CN114822533B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a voice interaction method, a model training method, electronic equipment and a storage medium. The voice interaction method comprises the following steps: acquiring user voice data to perform voice recognition in real time to obtain a user voice request; under the condition that a complete user voice request is not received, predicting the user voice request according to the user voice request acquired in real time and the recurrent neural network model to obtain a prediction result; processing the prediction result to obtain a first prediction instruction; and after receiving the complete user voice request, if the prediction result is the same as the received complete user voice request, completing voice interaction according to the first prediction instruction. The invention predicts and supplements the user voice request based on the recurrent neural network model, reduces the total time required by the dialogue system to process the user voice request, adopts the lightweight model, can accurately and quickly predict the conventional sentences, and can control the richness, the parameter quantity, the time cost and the cost.

Description

Voice interaction method, model training method, electronic device and storage medium
Technical Field
The present invention relates to the field of voice interaction technologies, and in particular, to a voice interaction method, a model training method, an electronic device, and a storage medium.
Background
For text completion, similar services include prompt words of an input method, input prompt of a search box, completion prompt of codes, automatic generation of texts, and the like. First, a conventional search algorithm can be used, but this requires storing a large amount of data in advance, and searching in a huge database, using a space-to-time method to improve efficiency and accuracy. Secondly, a pre-training text generation model which is popular in recent years can be used, although the accuracy is high and the diversity is rich, the model parameter quantity is huge, the text in a specific field needs to be trained for a long time, the time for generating the text is also long, the requirement on equipment is high, and the time cost is also high.
However, in the scenario of speech recognition processing, it is required that the generation speed of the model is fast, the accuracy is high, and the throughput is low, so that the model requirement is not too large, diversity is not pursued, and the storage space is as small as possible in order to save cost, in order to capture the time gain by predicting the user's intention in advance within the normal speaking time of the user, and at the same time, the normal use of the system is not affected.
Disclosure of Invention
The embodiment of the invention provides a voice interaction method, a model training method, electronic equipment and a storage medium.
The embodiment of the invention provides a voice interaction method. The voice interaction method comprises the following steps: acquiring user voice data to perform voice recognition in real time to obtain a user voice request; under the condition that the complete user voice request is not received, predicting the user voice request according to the user voice request acquired in real time and a recurrent neural network model to obtain a prediction result; processing the prediction result to obtain a first prediction instruction; and after receiving the complete user voice request, if the prediction result is the same as the received complete user voice request, completing voice interaction according to the first prediction instruction.
Therefore, the voice interaction method provided by the invention can be used for predicting and completing the voice request of the user based on the recurrent neural network model, the total time required by the dialogue system for processing the voice request of the user is reduced, meanwhile, the light weight model is adopted, the conventional statement can be accurately and quickly predicted, and the richness, the parameter quantity, the time cost and the cost can be controlled.
Under the condition that the complete user voice request is not received, predicting the user voice request according to the user voice request acquired in real time and a recurrent neural network model to obtain a prediction result, wherein the prediction result comprises the following steps: providing the user voice request acquired in real time as current input to the recurrent neural network model for prediction to obtain a predicted character; under the condition that the predicted character is not a preset character, splicing the predicted character and the current input to obtain a next input, and providing the next input to the recurrent neural network model for prediction again; and taking the current input as the prediction result under the condition that the predicted character is a preset character.
Therefore, the incomplete user voice request can be predicted according to the recurrent neural network model to obtain a prediction result.
Under the condition that the predicted character is not a preset character, splicing the predicted character and the current input to obtain a next input, and providing the next input to the recurrent neural network for prediction again, wherein the predicting comprises the following steps: obtaining the confidence coefficient of the predicted character and the entropy of the prediction probability distribution; and under the condition that the confidence coefficient is greater than a first threshold value and the entropy of the prediction probability distribution is less than a second threshold value, splicing the predicted character and the current input to obtain a next input, and providing the next input to the recurrent neural network for prediction again.
Thus, the invention establishes double limits, the confidence of the prediction result needs to be higher than a first threshold value, the entropy of the prediction distribution needs to be lower than a second threshold value, and the word is recognized only if each predicted word meets the two conditions at the same time, so that the number of the results with high probability is small, and the throughput is controlled.
The voice interaction method comprises the following steps: and determining that the prediction fails under the condition that the number of characters input by the recurrent neural network model is greater than the maximum predicted number of characters.
Therefore, the total time required by the dialogue system for processing the voice request of the user can be ensured to be reduced, and meanwhile, the prediction efficiency of the recurrent neural network model is improved.
Under the condition that the complete user voice request is not received, predicting the user voice request according to the user voice request acquired in real time and a recurrent neural network model to obtain a prediction result, wherein the prediction result comprises the following steps: under the condition that the complete user voice request is not received, completing the user voice request acquired in real time based on a prefix tree; and under the condition that a completion result is not obtained by performing completion based on the prefix tree, predicting the user voice request according to the user voice request acquired in real time and the cyclic neural network model to obtain a prediction result.
Therefore, the voice interaction method can obtain a completion result in time through the personalized model and the global model.
The voice interaction method comprises the following steps: under the condition that the completion result is obtained by performing completion based on the prefix tree, processing the completion result to obtain a second prediction instruction; and after receiving the complete user voice request, if the completion result is the same as the received complete user voice request, completing voice interaction according to the second prediction instruction.
Therefore, the completion of the incomplete voice request of the user can be completed based on the prefix tree to obtain a completion result, the subsequently received complete voice request is compared with the completion result, and when the received complete voice request is the same as the completion result, the voice interaction can be completed according to a second prediction instruction corresponding to the completion result.
Under the condition that the complete user voice request is not received, predicting the user voice request according to the user voice request acquired in real time and a recurrent neural network model to obtain a prediction result, wherein the prediction result comprises the following steps: under the condition that the complete user voice request is not received, completing the user voice request acquired in real time based on a prefix tree to obtain a completion result; and predicting the completion result according to the recurrent neural network model to obtain the prediction result.
Therefore, the received incomplete user voice request of the user can be supplemented by utilizing the personalized model to obtain a supplement result, and then the supplement result is input into a loop constructed by a Long Short-Term Memory (LSTM) network for scoring and sequencing to obtain a prediction result, so that organic fusion of the two models is realized, and a more accurate prediction result is obtained.
The voice interaction method comprises the following steps: under the condition that the complete user voice request is not received, completing the user voice request acquired in real time based on a prefix tree to obtain a completion result; processing the completion result to obtain a third prediction instruction; after receiving the complete user voice request, if the prediction result or the completion result is the same as the received complete user voice request, completing voice interaction according to the corresponding first prediction instruction or the third prediction instruction.
Therefore, the received incomplete user voice request is processed through the global model and the personalized model simultaneously to obtain a prediction result and a completion result, and the prediction result and the completion result are processed respectively to obtain a first prediction instruction or a third prediction instruction, so that the prediction instruction can be obtained according to the received incomplete user voice request, and the voice interaction is completed.
The voice interaction method comprises the following steps: determining completion conditions through data analysis; and under the condition that the user voice request acquired in real time meets the completion condition, completing the user voice request acquired in real time based on the prefix tree to obtain a completion result.
Therefore, the voice interaction method and the voice interaction device can construct instruction prediction closely associated with the individual based on the prefix tree for the received incomplete user voice request, realize automatic completion of the words strongly associated with the individual, obtain the completion result and further realize the technical effect of thousands of people and thousands of faces that can recognize the personalized words of different users.
The completing the user voice request acquired in real time based on the prefix tree to obtain the completing result under the condition that the user voice request acquired in real time meets the completing condition includes: determining the voice type of the user voice request according to the user voice request acquired in real time under the condition that the user voice request acquired in real time meets the completion condition; and selecting the corresponding prefix tree according to the voice type to complement the user voice request acquired in real time to obtain a complementing result.
Therefore, the voice interaction method can identify which type of personalized voice type the user voice request belongs to according to the user voice request acquired in real time, so that the user voice request is automatically completed according to the prefix tree corresponding to the voice type, and the technical effect that personalized words of different users can be identified by thousands of people can be achieved.
The voice interaction method comprises the following steps: and after the complete user voice request is received and a user voice instruction is obtained, adding the complete user voice request to the prefix tree.
Therefore, the voice interaction method can update the prefix tree by recording the statements every day, and the real-time performance of the prefix tree is guaranteed.
The voice interaction method comprises the following steps: acquiring a historical user voice request in a preset time period; and constructing the prefix tree according to the historical user voice request.
Therefore, the initial prefix tree can be constructed according to the historical user voice requests in the preset time period, and a foundation is laid for subsequently searching the prefix tree to complement the user voice requests.
The voice interaction method comprises the following steps: setting a forgetting duration for the user voice request in the prefix tree; and deleting the corresponding user voice request in the prefix tree under the condition that the unused time length of the user voice request in the prefix tree reaches the forgetting time length.
Therefore, the forgetting duration is set, and the historical voice requests before a longer time are removed from the prefix tree, so that the storage cost is reduced, and the real-time property of the prefix tree is also ensured.
The voice interaction method comprises the following steps: counting the use frequency, the request length and/or the request proportion of the voice requests of the historical users; and determining the weight corresponding to the user voice request in the prefix tree according to the use frequency, the request length and/or the request proportion.
Therefore, the weight of the historical voice request which is used for a long time and has low frequency can be reduced, and the real-time property of the prefix tree is ensured.
The invention also provides a model training method for training the recurrent neural network model of the voice interaction method in any one of the above embodiments. The model training method comprises the following steps: acquiring training data; constructing the recurrent neural network model; and based on a Bayesian method, training the recurrent neural network by using the training data to obtain the trained recurrent neural network model.
Therefore, the invention carries out voice interaction based on the circulating neural network model trained by the Bayesian method, can accurately and quickly predict the conventional sentences, and can control the richness, parameter quantity, time consumption and cost of the sentences.
The acquiring training input data comprises: acquiring a plurality of historical user voice requests; splicing the plurality of historical user voice requests through preset characters to obtain a request character string; and segmenting the request character string according to the maximum processing length to obtain the training data.
In this manner, training data may be derived based on the longest processing length.
The obtaining of the training data by segmenting the request character string according to the maximum processing length includes: traversing the request character string, and intercepting the request character string from the current character by the maximum processing length to obtain training input data; intercepting the request character string from the next character of the current character by the maximum processing length to obtain training result data.
Therefore, the recurrent neural network model can be trained through training input data and training result data, and the trained recurrent neural network model is obtained.
The invention also provides electronic equipment. The electronic device comprises a processor and a memory, wherein the memory stores a computer program, and when the computer program is executed by the processor, the voice interaction method of any one of the above embodiments is realized.
Therefore, the voice interaction method applied to the electronic equipment provided by the invention can be used for predicting and completing the voice request of the user based on the recurrent neural network model, so that the total time required by a dialogue system for processing the voice request of the user is reduced, meanwhile, the light weight model is adopted, the conventional sentences can be accurately and quickly predicted, and the richness, the parameter quantity, the time cost and the cost can be controlled.
The present invention also provides a non-transitory computer-readable storage medium containing the computer program. The computer program, when executed by one or more processors, implements the voice interaction method of any of the above embodiments.
Therefore, the voice interaction method applied to the storage medium of the invention predicts and supplements the user voice request based on the recurrent neural network model, reduces the total time required by the dialogue system to process the user voice request, adopts the lightweight model, can accurately and quickly predict the conventional sentences, and can control the richness, the parameter quantity, the time cost and the cost.
Additional aspects and advantages of embodiments of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart diagram of a voice interaction method of the present invention;
FIG. 2 is a schematic diagram of the structure of the voice interaction device of the present invention;
FIG. 3 is a flow chart diagram of a voice interaction method of the present invention;
FIG. 4 is a flow chart diagram of a voice interaction method of the present invention;
FIG. 5 is a flow chart diagram of a voice interaction method of the present invention;
FIG. 6 is a schematic view of a scenario of a prediction process of a prediction result of the voice interaction method of the present invention;
FIG. 7 is a flow chart diagram of a voice interaction method of the present invention;
FIG. 8 is a diagram illustrating a prefix tree structure in the voice interaction method of the present invention;
FIG. 9 is a flow diagram of a processing mechanism of the prior art streaming ASR technology framework;
FIG. 10 is a flow diagram of the processing mechanism of the ASR technology framework of the speech interaction method of the present invention;
FIG. 11 is a flow chart diagram of a voice interaction method of the present invention;
FIG. 12 is a flow chart diagram of a voice interaction method of the present invention;
FIG. 13 is a flow chart diagram of a voice interaction method of the present invention;
FIG. 14 is a schematic structural diagram of a voice interaction device of the present invention;
FIG. 15 is a flow chart diagram of the voice interaction method of the present invention;
FIG. 16 is a flow chart diagram of a voice interaction method of the present invention;
FIG. 17 is a flow chart diagram of a voice interaction method of the present invention;
FIG. 18 is a flow chart diagram of a voice interaction method of the present invention;
FIG. 19 is a flow chart of the voice interaction device of the present invention;
FIG. 20 is a flow chart diagram of the voice interaction method of the present invention;
FIG. 21 is a flow chart diagram of a voice interaction method of the present invention;
FIG. 22 is a schematic flow chart diagram of a model training method of the present invention;
FIG. 23 is a schematic view of the structure of the model training apparatus of the present invention;
FIG. 24 is a schematic flow chart diagram of a model training method of the present invention;
FIG. 25 is a schematic flow chart diagram of a model training method of the present invention;
FIG. 26 is a schematic diagram of a training process for training a model in the model training method of the present invention;
FIG. 27 is a schematic diagram of the structure of the electronic device of the present invention;
fig. 28 is a schematic structural diagram of a computer-readable storage medium of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are exemplary only for explaining the embodiments of the present invention, and are not construed as limiting the embodiments of the present invention.
Referring to fig. 1, the present invention provides a voice interaction method. The voice interaction method comprises the following steps:
01: acquiring user voice data to perform voice recognition in real time to obtain a user voice request;
02: under the condition that a complete user voice request is not received, predicting the user voice request according to the user voice request acquired in real time and the recurrent neural network model to obtain a prediction result;
03: processing the prediction result to obtain a first prediction instruction;
04: and after receiving the complete user voice request, if the prediction result is the same as the received complete user voice request, completing voice interaction according to the first prediction instruction.
Referring to fig. 2, the present invention further provides a voice interaction apparatus 10. The voice interaction apparatus 10 includes: a first obtaining module 11, a predicting module 12, an instruction generating module 13 and a comparing module 14.
Step 01 may be implemented by the first obtaining module 11, step 02 may be implemented by the predicting module 12, step 03 may be implemented by the instruction generating module 13, and step 04 may be implemented by the comparing module 14. That is, the first obtaining module 11 is configured to obtain the user voice data to perform voice recognition in real time to obtain the user voice request; the prediction module 12 is configured to, when a complete user voice request is not received, predict the user voice request according to the user voice request acquired in real time and the recurrent neural network model to obtain a prediction result; the instruction generating module 13 is configured to process the prediction result to obtain a first prediction instruction; the comparing module 14 is configured to, after receiving the complete user voice request, complete voice interaction according to the first prediction instruction if the prediction result is the same as the received complete user voice request.
Specifically, first, user voice data is acquired to perform voice recognition in real time to obtain a user voice request. The voice request input by the user is the user voice request obtained by acquiring user voice data, wherein the user voice data is an audio stream directly input by the user, and then performing real-time voice Recognition on the user voice data by using an Automatic Speech Recognition (ASR) technology. It will be appreciated that the goal of automatic speech recognition technology is to convert the lexical content in human speech into computer readable input, such as keystrokes, binary codes or character sequences. That is, a user voice request, which may be computer readable, may be obtained by acquiring an audio stream directly input by the user through automatic voice recognition.
And then, under the condition that the complete user voice request is not received, predicting the user voice request according to the user voice request acquired in real time and the recurrent neural network model to obtain a prediction result. A lightweight Long Short-Term Memory network (LSTM) is selected to construct a recurrent neural network (Seq2Seq) model overall situation, so that conventional sentences can be accurately and quickly predicted, and the richness, parameter quantity, time consumption and cost of the sentences can be controlled.
Then, the prediction result can be processed to obtain a first prediction instruction, and after a complete user voice request is received, if the prediction result is the same as the received complete user voice request, voice interaction is completed according to the first prediction instruction. That is, before the automatic speech recognition technology returns a complete user speech request, the speech interaction method of the present application may predict the obtained user speech request based on the recurrent neural network model to generate a prediction result, and send the prediction result to a subsequent module in advance to process to obtain a first prediction instruction, where the subsequent module includes a Natural Language Understanding (NLU) module, a Dialog Management (DM) module, and a BOT (BOT) module, and after the automatic speech recognition technology returns a complete user speech request, if the prediction result is the same as the received complete user speech request, the subsequent processing on the complete user speech request is not required, and the total duration required for the dialog system to process the user speech request may be effectively reduced.
Therefore, the voice interaction method provided by the invention can be used for predicting and completing the voice request of the user based on the recurrent neural network model, the total time required by the dialogue system for processing the voice request of the user is reduced, meanwhile, the light weight model is adopted, the conventional statement can be accurately and quickly predicted, and the richness, the parameter quantity, the time cost and the cost can be controlled.
In step 02, the user voice request obtained in real time may be a voice text returned by an automatic voice recognition technology, and the user voice request is predicted by using a recurrent neural network model under the condition that the number of characters of the voice text in the user voice request obtained in real time is not less than 2 characters and less than 10 characters. For example, in step 02, the voice request of the user may be predicted when 2 characters are obtained in real time, so as to obtain a corresponding prediction result when 2 characters are input, and then, the automatic speech recognition technology may perform prediction once more when adding one character to the voice text returned in real time, so as to obtain a corresponding prediction result when different numbers of characters are input, and perform corresponding processing to obtain a plurality of first prediction instructions. Therefore, if any prediction result in the plurality of prediction results is the same as the user voice request returned by the automatic voice recognition technology, the voice interaction can be directly completed according to the corresponding first prediction instruction. Of course, in the case that a complete user voice request is not received, how to determine whether to predict according to the user voice request acquired in real time may not be limited to the above-discussed manner, but may be changed according to actual needs.
It should be noted that the automatic speech recognition technology can determine whether a complete user speech request is received after a period of continuous listening time by a wait time-out mechanism, so as to ensure the integrity of the user speech as much as possible.
Referring to fig. 3, step 02 includes:
021: providing a user voice request acquired in real time as current input to a recurrent neural network model for prediction to obtain a predicted character;
022: under the condition that the predicted character is not a preset character, splicing the predicted character and the current input to obtain the next input, and providing the next input to the recurrent neural network model for prediction again;
023: in the case where the predicted character is a preset character, the current input is taken as a prediction result.
Referring to fig. 2, steps 021, 022, 023 may be implemented by prediction module 12. That is, the prediction module 12 is configured to provide the user voice request obtained in real time as the current input to the recurrent neural network model for prediction to obtain a predicted character; under the condition that the predicted character is not a preset character, splicing the predicted character and the current input to obtain the next input, and providing the next input to the recurrent neural network model for prediction again; in the case where the predicted character is a preset character, the current input is taken as a prediction result.
Specifically, under the condition that a complete user voice request is not received, predicting the user voice request according to the user voice request acquired in real time and the recurrent neural network model to obtain a prediction result comprises the following steps:
firstly, a user voice request acquired in real time is used as current input and provided to a recurrent neural network model for prediction to obtain a predicted character. For example, if the user voice request acquired in real time is "on", the "on" is provided as the current input to the recurrent neural network model for prediction to obtain a predicted character, and as shown in fig. 3, the obtained predicted character is "air conditioner".
And then, under the condition that the predicted character is not the preset character, splicing the predicted character and the current input to obtain the next input, supplying the next input to the recurrent neural network model for prediction again, and under the condition that the predicted character is the preset character, taking the current input as a prediction result. For example, in fig. 3, the currently input "on" is predicted by the recurrent neural network model, the predicted character "null", the preset character is "EOS", i.e. the predicted character "empty" is not the preset character "EOS", the predicted character and the current input can be concatenated to obtain the next input which is provided to the recurrent neural network model, namely, the next input is 'open null' and is provided for the recurrent neural network model to predict again to obtain the predicted character 'tone', under the condition that the predicted character 'tone' is not the preset character 'EOS', the predicted character and the current input can be spliced to obtain the next input and the next input is provided for the recurrent neural network model, namely, the next input is 'turn on air conditioner' and is provided for the cyclic neural network model to predict again to obtain a predicted character 'EOS', at the moment, the predicted character is a preset character, and the current input 'turn on air conditioner' can be used as a prediction result.
Therefore, the incomplete user voice request can be predicted according to the recurrent neural network model to obtain a prediction result.
Referring to fig. 4, step 022 includes:
0221: obtaining the confidence coefficient of the predicted character and the entropy of the prediction probability distribution;
0222: and under the condition that the confidence coefficient is greater than a first threshold value and the entropy of the prediction probability distribution is less than a second threshold value, splicing the predicted characters and the current input to obtain the next input, and providing the next input to the recurrent neural network for prediction again.
Referring to fig. 2, steps 0221 and 0222 may be implemented by prediction module 12. That is, the prediction module 12 is configured to obtain the confidence of the predicted character and the entropy of the prediction probability distribution; and under the condition that the confidence coefficient is greater than a first threshold value and the entropy of the prediction probability distribution is less than a second threshold value, splicing the predicted characters and the current input to obtain the next input, and providing the next input to the recurrent neural network for prediction again.
Specifically, the confidence of the predicted character refers to the accuracy of obtaining whether the predicted character is the character that the user really wants to express. The entropy of the prediction probability distribution is a probability distribution value based on the long-short term memory network prediction and is used for indicating whether the predicted character based on the long-short term memory network prediction is similar to the character which the user really wants to express or not, and the smaller the entropy of the prediction probability distribution is, the closer the predicted character based on the long-short term memory network is to the character which the user really wants to express.
The method comprises the steps of firstly obtaining the confidence coefficient of a predicted character and the entropy of the prediction probability distribution, then, under the condition that the confidence coefficient is larger than a first threshold value and the entropy of the prediction probability distribution is smaller than a second threshold value, namely, the accuracy of the character which is really expressed by a user is high and is similar to or the same as the character which is really expressed by the user, splicing the predicted character and the current input to obtain the next input, and providing the next input to the recurrent neural network for prediction again.
For example, the first threshold may be 60%, 65%, 68%, 72%, 75%, 79%, 80%, 82%, 85%, and 90%, and the second threshold may be 0, 0.1, 0.11, 0.15, 0.2, 0.21, 0.22, 0.23, 0.24, and 0.25.
It is understood that if the Entropy (Entropy) of the long-short term memory network prediction probability distribution is higher than the second threshold, the current round of prediction is stopped.
Thus, the invention establishes double limits, the confidence of the prediction result needs to be higher than a first threshold value, the entropy of the prediction distribution needs to be lower than a second threshold value, and the word is recognized only if each predicted word meets the two conditions at the same time, so that the number of the results with high probability is small, and the throughput is controlled.
Referring to fig. 5, the voice interaction method includes:
024: and determining that the prediction fails under the condition that the number of characters input by the recurrent neural network model is greater than the maximum predicted number of characters.
Referring to fig. 2, step 024 may be implemented by prediction module 12. That is, the prediction module 12 is configured to determine that the prediction has failed if the number of characters input by the recurrent neural network model is greater than the maximum predicted number of characters.
It can be understood that, as the more characters are input by the user, the more time is consumed, and the more characters are input, the more characters are greater than a certain number of characters, the prediction is not necessary, so the maximum predicted number of characters is set in the invention, the total time required by the dialogue system to process the voice request of the user is ensured to be reduced as much as possible, and the prediction efficiency is improved.
Specifically, the maximum predicted character number of the present invention may be 10 characters, or may be other numerical values, and the maximum predicted character number is a numerical value determined by a plurality of experiments.
When the maximum predicted number of characters may be 10 characters, in the case where the number of characters input by the recurrent neural network model is greater than 10 characters, the input characters are not predicted, that is, it is determined that the prediction has failed.
During prediction, single characters are predicted one by one, and if the next character with the maximum predicted character number still does not predict [ EOS ] ", the prediction result is not sent down.
It should be noted that the voice interaction method of the present invention may use a bundle Search algorithm (Beam Search) instead of a greedy algorithm or an exhaustive method to Search for the next character. The cluster searching algorithm enlarges the searching space, but is far less than an exhaustive searching exponential searching space, and is a compromise algorithm of a greedy algorithm and an exhaustive method.
Therefore, the total time required by the dialogue system for processing the voice request of the user can be ensured to be reduced, and meanwhile, the prediction efficiency of the recurrent neural network model is improved.
Referring to fig. 7, step 02 includes:
025: under the condition that a complete user voice request is not received, completing the user voice request acquired in real time based on the prefix tree;
026: and under the condition that a completion result is not obtained by performing completion based on the prefix tree, predicting the user voice request according to the user voice request acquired in real time and the cyclic neural network model to obtain a prediction result.
Referring to fig. 2, steps 025 and 026 can be implemented by prediction module 12. That is, the prediction module 12 is configured to complete the user voice request obtained in real time based on the prefix tree when a complete user voice request is not received; and under the condition that a completion result is not obtained by performing completion based on the prefix tree, predicting the user voice request according to the user voice request acquired in real time and the cyclic neural network model to obtain a prediction result.
It can be understood that the global model mainly reflects the general usage of the whole user, and the unique habit of the user, in order to improve the accuracy and guarantee the speed and the throughput, the related sentences of the part can be listed separately and specially solved by using the personalized model. Therefore, the invention can adopt a mode of fusing the global model and the personalized model to complete the voice request of the user. The global model is trained based on the corpus of voice requests of a full number of users and is mainly used for predicting global common effective instructions. The personalized model is constructed based on the corpus of the user voice request of the user individual and is used for instruction prediction of navigation, music, telephone, high frequency and the like which are closely related to individuals. The global model of the invention is the recurrent neural network model constructed based on the long-short term memory network as described above. The personalized model is constructed based on a prefix tree algorithm.
Referring to table 1, the personalized user voice request can be divided into two types, one is a closed user voice data real-time recognition to obtain a corresponding user voice request such as "open vehicle status". The other is that the open user voice data is recognized in real time to obtain a corresponding user voice request, such as 'navigation to the yota holiday square', and the open user voice request is embodied in that the slot position is open.
TABLE 1
Figure BDA0003591032350000111
Specifically, first, user voice data is acquired to perform voice recognition in real time to obtain a user voice request, so that a personalized user voice request meeting the above conditions is recognized, and personalized voice requirements for different users are met. The voice request input by the user is the user voice request obtained by acquiring user voice data, wherein the user voice data is an audio stream directly input by the user, and then performing real-time voice recognition on the user voice data by using an automatic voice recognition technology. It will be appreciated that the goal of automatic speech recognition technology is to convert the lexical content of human speech into computer readable input, such as keystrokes, binary codes or character sequences.
And then, under the condition that the complete user voice request is not received, completing the user voice request acquired in real time based on the prefix tree to obtain a completion result. That is, when the user voice request is obtained in real time and the complete user voice request is not received, once it is recognized that the incomplete user voice request conforms to the personalized user voice request in any one aspect, the matching sentence can be searched in the prefix tree for completion to obtain a completion result, and the completion result can be processed to obtain the predicted voice instruction.
The prefix tree algorithm is simple and easy to implement, and by means of the tree structure, each statement prefix is used as a node to store statements, so that the statements can be searched quickly and conveniently. The prefix tree may be as shown in fig. 8, with the nodes in the prefix tree of fig. 8 being "days" and "songs".
It can be understood that, since the prefix tree can complement the user voice request acquired in real time, that is, the invention updates the prefix tree in real time by recording the statements input by the user every day, so as to ensure the real-time property of the prefix tree.
And after receiving the complete user voice request, if the completion result is the same as the received complete user voice request, completing voice interaction according to the predicted voice instruction. That is, when a complete user voice request is received, the completion result may be compared with the complete user voice request, and if the sentences of the two voice requests are completely the same or if the sentences have a deviation but the meanings of the sentences are the same, the completion result may be considered as the same as the received complete user voice request.
Therefore, when the completion result is the same as the received complete user voice request, the voice interaction can be directly completed according to the predicted voice instruction, so that the technical effects of utilizing the predicted user intention in advance in the normal speaking time of the user, accelerating the voice interaction process in the time level and having high generation speed and high accuracy of the predicted voice instruction are achieved.
It can be understood that, referring to fig. 9, the processing mechanism of the current streaming ASR technical framework is to recognize the speech information coming from the user frame by frame and return the recognized result in real time, and finally there is a waiting timeout mechanism for continuous listening to ensure the integrity of the user utterance as much as possible. As shown in fig. 9, the available time of the processing mechanism of the current streaming ASR technical framework is the time that the user voice request is obtained until the ASR timeout waiting is finished, and the maximum rushable profit time is the time that the Natural Language Understanding (NLU) module, the Dialog Management (DM) module and the automatic program (BOT) module perform processing, that is, the time that the primitive voice request of NLU + DM + BOT should be processed.
Referring to fig. 10, before the ASR returns a complete sentence, the speech interaction method of the present invention completes the obtained user speech request based on the prefix tree, and sends the completed user speech request to the subsequent module for processing in advance, so that the total time required by the dialog system to process the user instruction can be effectively reduced. That is, the voice interaction method of the present application accurately predicts the expression result of the user by using the user voice instruction with low integrity, and considers that the prediction is correct if the semantics of the predicted expression result or the user complete voice request are completely the same, so that the total time required by the dialog system to process the user instruction can be effectively reduced.
In addition, under the condition that the completion result is not obtained by completing based on the prefix tree, the user voice request is predicted according to the user voice request and the cyclic neural network model which are obtained in real time to obtain a prediction result, and the prediction result can be obtained by predicting the user voice request according to the cyclic neural network model in time when the completion result is not obtained through the prefix tree.
That is, the personalized model may be preferentially considered to complete the incomplete voice request of the user, and the global model is used to complete the incomplete voice request to obtain the complete result when the complete result cannot be obtained through the personalized model.
Therefore, the voice interaction method can obtain a completion result in time through the personalized model and the global model.
Referring to fig. 11, the voice interaction method includes:
027: under the condition that a completion result is obtained by performing completion based on the prefix tree, processing the completion result to obtain a second prediction instruction;
028: and after receiving the complete user voice request, if the completion result is the same as the received complete user voice request, completing voice interaction according to the second prediction instruction.
Referring to fig. 2, steps 027 and 028 can be implemented by prediction module 12. That is, the prediction module 12 is configured to, when the prefix tree is used to perform completion to obtain a completion result, process the completion result to obtain a second prediction instruction; and after receiving the complete user voice request, if the completion result is the same as the received complete user voice request, completing voice interaction according to the second prediction instruction.
Specifically, the user voice request is sent to the server by frames, the streaming ASR on the server recognizes the frame by frame and broadcasts the recognition result in real time, and the electronic device may perform the following steps after receiving the broadcast recognition result each time: (1) condition interception: the number of words is limited to 2-10; (2) sending an incomplete text to the speech completion module to request completion, and if a completion output result exists, sending the completion result to a Natural Language Understanding (NLU) module, a Dialogue Management (DM) module and an automatic program (BOT) module to perform related processing to obtain a second prediction instruction.
And finally, when the streaming ASR module receives a complete user voice request, comparing a completion result with the complete voice request, judging that the completion result is consistent with the complete voice request after the completion result and the complete voice request are understood by natural language, judging that the completion is successful, and directly submitting a second prediction instruction obtained by processing the completion result for voice interaction.
If the completion result is judged to be inconsistent with the complete voice request after natural language understanding, the completion is considered to be failed, and the method can be normally carried out according to the original processing flow.
Therefore, the completion of the incomplete voice request of the user can be completed based on the prefix tree to obtain a completion result, the subsequently received complete voice request is compared with the completion result, and when the received complete voice request is the same as the completion result, the voice interaction can be completed according to a second prediction instruction corresponding to the completion result.
Referring to fig. 12, step 02 includes:
0291: under the condition that a complete user voice request is not received, completing the user voice request acquired in real time based on the prefix tree to obtain a completion result;
0292: and predicting the completion result according to the recurrent neural network model to obtain a prediction result.
Referring to FIG. 2, steps 0291 and 0292 can be implemented by prediction module 12. That is, the prediction module 12 is configured to, in a case that a complete user voice request is not received, perform completion on the user voice request obtained in real time based on the prefix tree to obtain a completion result; and predicting the completion result according to the recurrent neural network model to obtain a prediction result.
Specifically, first, under the condition that a complete user voice request is not received, the user voice request obtained in real time may be supplemented based on the prefix tree to obtain a completion result. And then, predicting the completion result according to the recurrent neural network model to obtain a prediction result.
Therefore, the received incomplete user voice request of the user can be supplemented by the personalized model to obtain a supplemented result, and then the supplemented result is input into a cycle constructed by the long-term and short-term memory network to be subjected to scoring and sequencing to obtain a prediction result, so that organic fusion of the two models is realized, and a more accurate prediction result is obtained.
It should be noted that, regarding the sorting preference of the completion result, the completion result may be sorted, screened and preferred by combining with current vehicle information, such as current time (morning | midnight, | weekday | weekend, | year), vehicle state (start | driving | stop), user voice request of previous round or large screen state (navigation interface | music interface), and the like.
That is, the voice interaction method of the present invention selectively screens the data of the completion result according to the target completion result to be achieved, and can find the difference between the screened completion result and the completion scheme generated by the common voice text, thereby forming a unique completion technical scheme in the specific field.
In addition, the voice interaction method organically fuses the global model and the personalized model, and the scheme has the advantages of strong controllability, low requirement on required equipment, easiness in realization and remarkable effect.
Referring to fig. 13, the voice interaction method includes:
05: under the condition that a complete user voice request is not received, completing the user voice request acquired in real time based on the prefix tree to obtain a completion result;
06: processing the completion result to obtain a third prediction instruction;
07: and after receiving the complete user voice request, if the prediction result or the completion result is the same as the received complete user voice request, completing voice interaction according to the corresponding first prediction instruction or the third prediction instruction.
Referring to fig. 14, the voice interaction apparatus 10 further includes a completion module 15.
Step 05 may be implemented by the completion module 15, step 06 may be implemented by the instruction generation module 13, and step 07 may be implemented by the comparison module 14. That is to say, the completion module 15 is configured to, in a case where a complete user voice request is not received, complete the user voice request obtained in real time based on the prefix tree to obtain a completion result; the instruction generating module 13 is configured to process the completion result to obtain a third prediction instruction; the comparison module 14 is configured to, after receiving the complete user voice request, complete voice interaction according to the corresponding first prediction instruction or third prediction instruction if the prediction result or the completion result is the same as the received complete user voice request.
Specifically, the received incomplete user voice request may be processed through the global model and the personalized model simultaneously to obtain a prediction result and a completion result, and the prediction result and the completion result may be processed respectively to obtain the first prediction instruction or the third prediction instruction.
Then, after receiving a complete user voice request, if the prediction result or the completion result is the same as the received complete user voice request, completing voice interaction according to the corresponding first prediction instruction or the third prediction instruction means:
and if the prediction result is the same as the received complete user voice request, completing voice interaction according to the corresponding first prediction instruction. And if the completion result is the same as the received complete user voice request, completing voice interaction according to the corresponding third prediction instruction. And if the prediction result and the completion result are the same as the received complete user voice request, completing voice interaction according to any one of the first prediction instruction or the third prediction instruction.
Therefore, the received incomplete user voice request is processed through the global model and the personalized model simultaneously to obtain a prediction result and a completion result, and the prediction result and the completion result are processed respectively to obtain a first prediction instruction or a third prediction instruction, so that the prediction instruction can be obtained according to the received incomplete user voice request, and the voice interaction is completed.
Referring to fig. 15, the voice interaction method includes:
0251: determining completion conditions through data analysis;
0252: and under the condition that the user voice request acquired in real time meets the completion condition, completing the user voice request acquired in real time based on the prefix tree to obtain a completion result.
Referring to FIG. 2, steps 0251 and 0252 can be implemented by prediction module 12. That is, the prediction module 12 is used to determine completion conditions through data analysis; and under the condition that the user voice request acquired in real time meets the completion condition, completing the user voice request acquired in real time based on the prefix tree to obtain a completion result.
Specifically, the completion condition is determined by data analysis. For example, the completion condition may be: the number of the acquired words ranges from 2 to 10. The range of word counts is a range determined from on-line data analysis.
And under the condition that the user voice request acquired in real time meets the completion condition, completing the user voice request acquired in real time based on the prefix tree to obtain a completion result. That is, when the number of words of the user voice request obtained in real time reaches 2, or 3, 4, 5, 6, 7, 8, 9, or 10, the incomplete user voice request may be complemented based on the prefix tree to obtain a completed result, and of course, the word number range may also be changed according to actual needs.
For example, referring to fig. 8, when the number of words of the user voice request obtained in real time is 2, and the user voice request is "japanese", the user voice request may be complemented based on the prefix tree in fig. 8, so that the complemented result is "japanese-sunset".
Therefore, the voice interaction method can construct instruction prediction closely associated with the individual based on the prefix tree for the received incomplete user voice request, and realize automatic completion of the words strongly associated with the individual, thereby obtaining a completion result, and further realizing the technical effect of identifying thousands of people and thousands of faces for the personalized words of different users.
Referring to fig. 16, step 0252 includes:
02521: determining the voice type of the user voice request according to the user voice request acquired in real time under the condition that the user voice request acquired in real time meets the completion condition;
02522: and selecting a corresponding prefix tree according to the voice type to complement the user voice request acquired in real time to obtain a complementing result.
Referring to FIG. 2, steps 02521 and 02522 may be implemented by prediction module 12. That is, the prediction module 12 is configured to determine the voice type of the user voice request according to the user voice request obtained in real time when the user voice request obtained in real time meets the completion condition; and selecting a corresponding prefix tree according to the voice type to complement the user voice request acquired in real time to obtain a complementing result.
Specifically, the personalized voice type of the user voice request may include the following 4 classes of navigation-type voice, music-type voice, telephone-type voice and high-frequency-type voice, respectively.
The sentence prefix tree corresponding to each voice genre of the 4 types of voice genres may be formed by, for example:
navigation class: i (want) (go) (around | around of the surrounding | POI #) (# POI #). That is, the navigation-like sentence prefix may be "i want", "i want to go around # POI #" i want to go around.
Music class: (i want to listen to a song list of | i want to listen to | play | search) [ next | down | that ] (slot value) [ all. That is, the phrase prefix of the music category may be "i want to listen", "i want to play", "i want to put", "i want to search", "i want to listen to slot value", "i want to listen to the song list of slot value".
Telephone type: (help me give | give me instead | i want | give) (# specify #) (store | experience store) (make a call | make a call). That is, the sentence prefix of the phone class may be "help me give", "replace me give", "want me give", "designate # store", "help me make a call to # designate # store".
High-frequency type: prefixes in sentences that appear more frequently for a period of time, e.g., "first", "turn on air conditioner".
For example, when the real-time voice recognition of the acquired user voice data obtains that the user voice request is "i want to listen", the user voice request "i want to listen" may be a music-type voice type according to the real-time acquired user voice request. According to the music type, a corresponding prefix tree is selected to automatically complement a user voice request 'i want to listen' acquired in real time to obtain a complementing result which can be 'i want to listen to a song list of slot value'.
Therefore, the voice interaction method can identify which type of personalized voice type the user voice request belongs to according to the user voice request acquired in real time, so that the user voice request is automatically completed according to the prefix tree corresponding to the voice type, and the technical effect that personalized words of different users can be identified by thousands of people can be achieved.
Referring to fig. 17, the voice interaction method includes:
02523: and after the received complete user voice request is processed to obtain a user voice instruction, adding the complete user voice request to the prefix tree.
Referring to FIG. 2, step 02523 may be performed by prediction module 12. That is, the prediction module 12 is configured to add the complete user voice request to the prefix tree after processing the received complete user voice request to obtain the user voice instruction.
Specifically, when the received complete user voice request is processed to obtain the user voice instruction to complete the voice interaction, it is indicated that the incomplete user voice request is completely complemented or fails to be predicted by the personalized model based on the prefix tree and the global model based on the cyclic neural network, and the complementing result and the predicting result are different from the complete user voice request, so that the complete user voice request can be processed to obtain the user voice instruction to complete the voice interaction after the complete user voice request is received.
That is, after receiving the complete user voice request, the complete user voice request can be added to the prefix tree after processing the received complete user voice request to obtain the user voice instruction, thereby implementing real-time update of the prefix tree.
Therefore, the voice interaction method can update the prefix tree by recording the statements every day, and the real-time performance of the prefix tree is guaranteed.
Referring to fig. 18, the voice interaction method includes:
001: acquiring a historical user voice request in a preset time period;
002: and constructing a prefix tree according to historical user voice requests.
Referring to fig. 19, the voice interaction apparatus 10 includes a prefix tree construction module 111.
Steps 001 and 002 may be implemented by prefix tree building module 111. That is, the prefix tree building module 111 is configured to obtain a historical user voice request within a preset time period; and constructing a prefix tree according to the historical voice requests of the users. Wherein, step 001 and step 002 may occur before step 01 or step 02, and are not limited herein.
Specifically, first, historical user voice requests may be recorded by day (or other unit of time). Then, the historical user voice request in the preset time period is obtained, the preset time period can be in the time period range of 7 days to 30 days before the current time, namely the historical user voice request in the previous 7 days to 30 days is obtained, and the real-time performance of a prefix tree constructed by the user voice request can be ensured.
Therefore, the initial prefix tree can be constructed respectively by recording four types of historical voice requests (music, navigation, telephone and high frequency) of each person through strategy matching according to the voice requests of the users with the history of the first 7-30 days as a basis.
Therefore, the initial prefix tree can be constructed according to the historical user voice requests in the preset time period, and a foundation is laid for subsequently searching the prefix tree to complement the user voice requests.
Referring to fig. 20, the voice interaction method includes:
003: setting forget duration for a user voice request in a prefix tree;
004: and deleting the corresponding user voice request in the prefix tree under the condition that the unused time length of the user voice request in the prefix tree reaches the forgetting time length.
Referring to fig. 19, steps 003 and 004 can be implemented by prefix tree building module 111. That is, the prefix tree building module 111 is configured to set a forgetting duration for the user voice request in the prefix tree; and deleting the corresponding user voice request in the prefix tree under the condition that the unused time length of the user voice request in the prefix tree reaches the forgetting time length. Steps 003 and 004 may occur before step 01 or step 02 and after step 002.
Specifically, the forgetting period may be, for example, 24h, 48h, 3 days, 5 days, 7 days, 10 days, 11 days, 12 days, or 30 days, which is not limited herein. The forgetting duration can be set by the user according to the user requirement.
Therefore, the forgetting duration is set, and the historical voice requests before a longer time are removed from the prefix tree, so that the storage cost is reduced, and the real-time property of the prefix tree is also ensured.
Referring to fig. 21, the voice interaction method includes:
005: counting the use frequency, the request length and/or the request proportion of the voice requests of historical users;
006: and determining the weight of the voice request of the corresponding user in the prefix tree according to the use frequency, the request length and/or the request proportion.
Referring to fig. 19, steps 005 and 006 can be implemented by prefix tree building module 111. That is, the prefix tree building module 111 is configured to count the usage frequency, the request length, and/or the request proportion of the voice request of the historical user; and determining the weight of the voice request of the corresponding user in the prefix tree according to the use frequency, the request length and/or the request proportion. Step 005 and step 006 may occur before step 01 or step 02 and after step 002.
Specifically, the using frequency, the request length and/or the request proportion of the voice requests of the historical users are counted, and the weight of the voice requests of the corresponding users in the prefix tree is determined according to the using frequency, the request length and/or the request proportion. That is, the weight of the corresponding user voice request in the prefix tree may be determined by counting one or more of the frequency of use of the user voice request, the request length of the user voice request, or the request duty ratio of the user voice request.
In other words, the order of the user voice requests may be rearranged or the weight of the user voice requests may be changed using statistical information of the historical voice requests, such as frequency, length, percentage, and the like.
Therefore, the weight of the historical voice request which is used for a long time and has low frequency can be reduced, and the real-time property of the prefix tree is ensured.
The method for training the recurrent neural network model according to the present invention is described in detail below.
Referring to fig. 22, the present invention provides a model training method for training a recurrent neural network model according to the voice interaction method of any of the above embodiments. The model training method comprises the following steps:
22: acquiring training data;
24: constructing a recurrent neural network model;
26: based on a Bayesian method, the training data is used for training the recurrent neural network to obtain a trained recurrent neural network model.
Referring to fig. 23, the present invention further provides a model training device 20. The model training apparatus 20 includes a second obtaining module 21, a model building module 22, and a model training module 23.
Step 22 may be implemented by the second obtaining module 21, step 24 may be implemented by the model building module 22, and step 26 may be implemented by the model training module 23. That is, the second obtaining module 21 is configured to obtain training data; the model construction module 22 is used for constructing a recurrent neural network model; the model training module 23 is configured to train the recurrent neural network by using the training data based on a bayesian method to obtain a trained recurrent neural network model.
Specifically, first, training data is obtained, where the training data may be a training user voice request obtained by processing randomly input voice data by a user through an automatic voice recognition technology, for example, the training user voice request may be a "turn on an air conditioner and then navigate home", "turn on a rear view mirror", or the like.
And then constructing a recurrent neural network model based on the acquired voice request of the training user. That is, the model training method of the present invention constructs the initial framework of the recurrent neural network model based on the lightweight, yet effective, long-short term memory network.
And finally, training the cyclic neural network by using the training data based on a Bayesian method to obtain a trained cyclic neural network model. Compared with manual parameter adjustment, grid search and random search, the method uses more scientific parameter adjustment based on the Bayesian method to establish a probability model of an objective function, and uses the probability model to select the optimal hyper-parameter to evaluate the real objective function.
The process of evaluating the real objective function is as follows: first, a probability model of the alternative objective function, i.e., the alternative function, is established. Then, an optimal hyperparameter is found that maximizes the surrogate function probability. And then substituting the found hyper-parameters into a machine learning model for training to obtain the score of the original target function. And updating the prior distribution (x, y) of the target function by using the score of the original target function. And finally, repeating the steps 2-4 until the maximum iteration number or the maximum time length is reached.
Therefore, the invention carries out voice interaction based on the circulating neural network model trained by the Bayesian method, can accurately and quickly predict the conventional sentences, and can control the richness, parameter quantity, time consumption and cost of the sentences.
Referring to fig. 24, step 22 includes:
221: acquiring a plurality of historical user voice requests;
222: splicing a plurality of historical user voice requests through preset characters to obtain a request character string;
223: and segmenting the request character string according to the maximum processing length to obtain training data.
Referring to fig. 22, step 221, step 222 and step 223 can be implemented by the model building module 22. That is, the model building module 22 is configured to obtain a plurality of historical user voice requests; splicing a plurality of historical user voice requests through preset characters to obtain a request character string; and segmenting the request character string according to the maximum processing length to obtain training data.
Specifically, first, a plurality of historical user voice requests are acquired, for example, n historical user voice requests within the first 7 days of the current time are acquired: query _1, Query _2, Query _3 … Query _ n.
Then, all the Query are spliced by using a preset character "[ EOS ]" to obtain a request character string: query _1+ [ EOS ] + Query _2+ [ EOS ] +.. Query _ n.
Assuming that the total length of the request character string is N, giving the longest processing length L of the request character string, and segmenting the request character string according to the longest processing length L to obtain training data.
In this manner, training data may be derived based on the longest processing length.
Referring to fig. 25, step 223 includes:
2231: traversing the request character string, and intercepting the request character string from the current character by the maximum processing length to obtain training input data;
2232: and intercepting the request character string from the next character of the current character by the maximum processing length to obtain training result data.
Referring to FIG. 22, steps 2231 and 2232 may be implemented by model building module 22. That is, the model building module 22 is configured to traverse the request character string, and intercept the request character string from the current character with the maximum processing length to obtain training input data; and intercepting the request character string from the next character of the current character by the maximum processing length to obtain training result data.
Specifically, for example, for the spliced request string: query _1+ [ EOS ] + Query _2+ [ EOS ] +.. Query _ n.
The request character string may be traversed, the request character string may be intercepted from a current character by a maximum processing length to obtain training input data X ═ i, i + L-1], and the request character string may be intercepted from a character next to the current character by the maximum processing length to obtain training result data, i ═ i +1, i + L, for i ═ 1., (N-1).
That is, the training data includes training input data and training result data, and if X ═ i, i + L-1 and the corresponding label Y ═ i +1, i + L, for i ═ 1., (N-1) are input to the recurrent neural network model, the training result can be output from the recurrent neural network model, as shown in fig. 26.
Therefore, the recurrent neural network model can be trained through training input data and training result data, and the trained recurrent neural network model is obtained.
Referring to fig. 27, the present invention further provides an electronic device 30. The electronic device 30 comprises a processor 31 and a memory 32, the memory 32 having stored thereon a computer program 321, the computer program 321, when executed by the processor 31, implementing the voice interaction method according to any of the embodiments described above. The electronic device 30 includes, but is not limited to, a vehicle, a mobile phone, an ipad, etc., and is not limited thereto.
In this way, the speech interaction method applied by the electronic device 30 of the present invention predicts and completes the user speech request based on the recurrent neural network model, reduces the total duration required by the dialog system to process the user speech request, and at the same time, adopts the lightweight model, can accurately and quickly predict the conventional sentences, and can control the richness, the parameter amount, the time cost and the cost.
Referring to fig. 28, the present invention also provides a non-volatile computer readable storage medium 40 containing a computer program. The computer program 41, when executed by the one or more processors 50, implements the voice interaction method described in any of the embodiments above.
For example, the computer program 41, when executed by the processor 50, implements the steps of the following voice interaction method:
01: acquiring user voice data to perform voice recognition in real time to obtain a user voice request;
02: under the condition that a complete user voice request is not received, completing the user voice request acquired in real time based on the prefix tree to obtain a completion result;
03: processing the completion result to obtain a predicted voice instruction;
04: and after receiving the complete user voice request, if the completion result is the same as the received complete user voice request, completing voice interaction according to the predicted voice instruction.
It will be appreciated that the computer program 41 comprises computer program code. The computer program code may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), software distribution medium, and the like.
The speech interaction method applied by the computer-readable storage medium 40 of the present invention predicts and supplements the user speech request based on the recurrent neural network model, reduces the total duration required by the dialog system to process the user speech request, and simultaneously adopts the lightweight model, and can accurately and rapidly predict the conventional sentences, and the richness, the parameter amount, the time cost and the cost can be controlled.

Claims (19)

1. A method of voice interaction, comprising:
acquiring user voice data to perform voice recognition in real time to obtain a user voice request;
under the condition that the complete user voice request is not received, predicting the user voice request according to the user voice request acquired in real time and a recurrent neural network model to obtain a prediction result;
processing the prediction result to obtain a first prediction instruction;
and after receiving the complete user voice request, if the prediction result is the same as the received complete user voice request, completing voice interaction according to the first prediction instruction.
2. The method of claim 1, wherein predicting the user voice request according to the user voice request and a recurrent neural network model obtained in real time to obtain a prediction result when the complete user voice request is not received comprises:
providing the user voice request acquired in real time as current input to the recurrent neural network model for prediction to obtain a predicted character;
under the condition that the predicted character is not a preset character, the predicted character and the current input are spliced to obtain a next input, and the next input is provided for the recurrent neural network model to predict again;
and taking the current input as the prediction result under the condition that the predicted character is a preset character.
3. The method of claim 2, wherein the step of splicing the predicted character and the current input to obtain a next input to be provided to the recurrent neural network for prediction again if the predicted character is not a preset character comprises:
obtaining the confidence coefficient of the predicted character and the entropy of the prediction probability distribution;
and under the condition that the confidence coefficient is greater than a first threshold value and the entropy of the prediction probability distribution is less than a second threshold value, splicing the predicted character and the current input to obtain a next input, and providing the next input to the recurrent neural network for prediction again.
4. The voice interaction method according to claim 2, wherein the voice interaction method comprises:
and determining that the prediction fails under the condition that the number of characters input by the recurrent neural network model is greater than the maximum predicted number of characters.
5. The method of claim 1, wherein predicting the user voice request according to the user voice request and a recurrent neural network model obtained in real time to obtain a prediction result when the complete user voice request is not received comprises:
under the condition that the complete user voice request is not received, completing the user voice request acquired in real time based on a prefix tree;
and under the condition that a completion result is not obtained by performing completion based on the prefix tree, predicting the user voice request according to the user voice request acquired in real time and the cyclic neural network model to obtain a prediction result.
6. The voice interaction method according to claim 5, wherein the voice interaction method comprises:
under the condition that the completion result is obtained by performing completion based on the prefix tree, processing the completion result to obtain a second prediction instruction;
and after receiving the complete user voice request, if the completion result is the same as the received complete user voice request, completing voice interaction according to the second prediction instruction.
7. The method of claim 1, wherein predicting the user voice request according to the user voice request and a recurrent neural network model obtained in real time to obtain a prediction result when the complete user voice request is not received comprises:
under the condition that the complete user voice request is not received, completing the user voice request acquired in real time based on a prefix tree to obtain a completion result;
and predicting the completion result according to the recurrent neural network model to obtain the prediction result.
8. The voice interaction method according to claim 1, wherein the voice interaction method comprises:
under the condition that the complete user voice request is not received, completing the user voice request acquired in real time based on a prefix tree to obtain a completion result;
processing the completion result to obtain a third prediction instruction;
after receiving the complete user voice request, if the prediction result or the completion result is the same as the received complete user voice request, completing voice interaction according to the corresponding first prediction instruction or the third prediction instruction.
9. The voice interaction method according to any one of claims 5 to 8, wherein the voice interaction method comprises:
determining completion conditions through data analysis;
and under the condition that the user voice request acquired in real time meets the completion condition, completing the user voice request acquired in real time based on the prefix tree to obtain a completion result.
10. The method according to claim 9, wherein the completing the user voice request obtained in real time based on the prefix tree to obtain the completion result when the user voice request obtained in real time satisfies the completion condition includes:
determining the voice type of the user voice request according to the user voice request acquired in real time under the condition that the user voice request acquired in real time meets the completion condition;
and selecting the corresponding prefix tree according to the voice type to complete the user voice request acquired in real time to obtain a completion result.
11. The voice interaction method according to claim 9, wherein the voice interaction method comprises:
and after the complete user voice request is received and a user voice instruction is obtained, adding the complete user voice request to the prefix tree.
12. The voice interaction method according to claim 9, wherein the voice interaction method comprises:
acquiring a historical user voice request in a preset time period;
and constructing the prefix tree according to the historical user voice request.
13. The voice interaction method according to claim 12, wherein the voice interaction method comprises:
setting a forgetting duration for the user voice request in the prefix tree;
and deleting the corresponding user voice request in the prefix tree under the condition that the unused time length of the user voice request in the prefix tree reaches the forgetting time length.
14. The voice interaction method according to claim 12, wherein the voice interaction method comprises:
counting the use frequency, the request length and/or the request proportion of the voice requests of the historical users;
and determining the weight corresponding to the user voice request in the prefix tree according to the use frequency, the request length and/or the request proportion.
15. A model training method for training the recurrent neural network model that yields the speech interaction method of any one of claims 1 to 14, comprising:
acquiring training data;
constructing the recurrent neural network model;
and based on a Bayesian method, training the recurrent neural network by using the training data to obtain the trained recurrent neural network model.
16. The model training method of claim 15, wherein the obtaining training input data comprises:
acquiring a plurality of historical user voice requests;
splicing the plurality of historical user voice requests through preset characters to obtain a request character string;
and segmenting the request character string according to the maximum processing length to obtain the training data.
17. The model training method of claim 16, wherein the segmenting the request string according to the maximum processing length to obtain the training data comprises:
traversing the request character string, and intercepting the request character string from the current character by the maximum processing length to obtain training input data;
intercepting the request character string from the next character of the current character by the maximum processing length to obtain training result data.
18. An electronic device, characterized in that the electronic device comprises a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements the voice interaction method of any of claims 1-14.
19. A non-transitory computer-readable storage medium embodying a computer program, wherein the computer program, when executed by one or more processors, implements the voice interaction method of any of claims 1-14.
CN202210378175.0A 2022-04-12 2022-04-12 Voice interaction method, model training method, electronic device and storage medium Active CN114822533B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210378175.0A CN114822533B (en) 2022-04-12 2022-04-12 Voice interaction method, model training method, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210378175.0A CN114822533B (en) 2022-04-12 2022-04-12 Voice interaction method, model training method, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN114822533A true CN114822533A (en) 2022-07-29
CN114822533B CN114822533B (en) 2023-05-12

Family

ID=82534048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210378175.0A Active CN114822533B (en) 2022-04-12 2022-04-12 Voice interaction method, model training method, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN114822533B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115083413A (en) * 2022-08-17 2022-09-20 广州小鹏汽车科技有限公司 Voice interaction method, server and storage medium
CN116110396A (en) * 2023-04-07 2023-05-12 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120166182A1 (en) * 2009-06-03 2012-06-28 Ko David H Autocompletion for Partially Entered Query
CN105759983A (en) * 2009-03-30 2016-07-13 触摸式有限公司 System and method for inputting text into electronic devices
US20180081964A1 (en) * 2016-09-22 2018-03-22 Yahoo Holdings, Inc. Method and system for next word prediction
US20210043196A1 (en) * 2019-08-05 2021-02-11 Samsung Electronics Co., Ltd. Speech recognition method and apparatus
CN113284496A (en) * 2021-07-22 2021-08-20 广州小鹏汽车科技有限公司 Voice control method, voice control system, vehicle, server, and storage medium
WO2021164244A1 (en) * 2020-02-18 2021-08-26 百度在线网络技术(北京)有限公司 Voice interaction method and apparatus, device and computer storage medium
CN113571064A (en) * 2021-07-07 2021-10-29 肇庆小鹏新能源投资有限公司 Natural language understanding method and device, vehicle and medium
CN113946719A (en) * 2020-07-15 2022-01-18 华为技术有限公司 Word completion method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105759983A (en) * 2009-03-30 2016-07-13 触摸式有限公司 System and method for inputting text into electronic devices
US20120166182A1 (en) * 2009-06-03 2012-06-28 Ko David H Autocompletion for Partially Entered Query
US20180081964A1 (en) * 2016-09-22 2018-03-22 Yahoo Holdings, Inc. Method and system for next word prediction
US20210043196A1 (en) * 2019-08-05 2021-02-11 Samsung Electronics Co., Ltd. Speech recognition method and apparatus
WO2021164244A1 (en) * 2020-02-18 2021-08-26 百度在线网络技术(北京)有限公司 Voice interaction method and apparatus, device and computer storage medium
CN113946719A (en) * 2020-07-15 2022-01-18 华为技术有限公司 Word completion method and device
WO2022012205A1 (en) * 2020-07-15 2022-01-20 华为技术有限公司 Word completion method and apparatus
CN113571064A (en) * 2021-07-07 2021-10-29 肇庆小鹏新能源投资有限公司 Natural language understanding method and device, vehicle and medium
CN113284496A (en) * 2021-07-22 2021-08-20 广州小鹏汽车科技有限公司 Voice control method, voice control system, vehicle, server, and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115083413A (en) * 2022-08-17 2022-09-20 广州小鹏汽车科技有限公司 Voice interaction method, server and storage medium
CN116110396A (en) * 2023-04-07 2023-05-12 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN116110396B (en) * 2023-04-07 2023-08-29 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium

Also Published As

Publication number Publication date
CN114822533B (en) 2023-05-12

Similar Documents

Publication Publication Date Title
US11676575B2 (en) On-device learning in a hybrid speech processing system
US10319381B2 (en) Iteratively updating parameters for dialog states
CN110473531B (en) Voice recognition method, device, electronic equipment, system and storage medium
US6505162B1 (en) Apparatus and method for portable dialogue management using a hierarchial task description table
US20210019599A1 (en) Adaptive neural architecture search
CN101189659B (en) Interactive conversational dialogue for cognitively overloaded device users
CN114822533B (en) Voice interaction method, model training method, electronic device and storage medium
US11200885B1 (en) Goal-oriented dialog system
CN109741735B (en) Modeling method, acoustic model acquisition method and acoustic model acquisition device
US11605376B1 (en) Processing orchestration for systems including machine-learned components
US11132994B1 (en) Multi-domain dialog state tracking
CN114822532A (en) Voice interaction method, electronic device and storage medium
US11532301B1 (en) Natural language processing
CN113571064B (en) Natural language understanding method and device, vehicle and medium
CN111199732A (en) Emotion-based voice interaction method, storage medium and terminal equipment
CN110164416B (en) Voice recognition method and device, equipment and storage medium thereof
CN113920999A (en) Voice recognition method, device, equipment and storage medium
CN114550718A (en) Hot word speech recognition method, device, equipment and computer readable storage medium
CN118200673B (en) Model training method, system, equipment and medium for program recommendation
US20240185846A1 (en) Multi-session context
CN117496972B (en) Audio identification method, audio identification device, vehicle and computer equipment
US11996081B2 (en) Visual responses to user inputs
EP4322066A1 (en) Method and apparatus for generating training data
WO2023172442A1 (en) Shared encoder for natural language understanding processing
CN116978367A (en) Speech recognition method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant