WO2021135534A1 - 基于语音识别的对话管理处理方法、装置、设备及介质 - Google Patents

基于语音识别的对话管理处理方法、装置、设备及介质 Download PDF

Info

Publication number
WO2021135534A1
WO2021135534A1 PCT/CN2020/122422 CN2020122422W WO2021135534A1 WO 2021135534 A1 WO2021135534 A1 WO 2021135534A1 CN 2020122422 W CN2020122422 W CN 2020122422W WO 2021135534 A1 WO2021135534 A1 WO 2021135534A1
Authority
WO
WIPO (PCT)
Prior art keywords
preset
corpus
dialogue
voice
speech
Prior art date
Application number
PCT/CN2020/122422
Other languages
English (en)
French (fr)
Inventor
叶怡周
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021135534A1 publication Critical patent/WO2021135534A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method, device, computer equipment, and computer-readable storage medium for dialogue management and processing based on speech recognition.
  • Speech recognition is generally performed through a speech recognition model, for example, through an ASR (Automatic Speech Recognition in English) model.
  • the recognition rate of the ASR model is mainly determined by the acoustic model and the language model. After the word sequence is generated through the acoustic model, the language model is used to select a group of word sequences that best meets the normal saying as the final phonetic conversion result.
  • ASR model training mainly uses accumulated training sample corpus for training. After the ASR model recognition training is verified by the development environment and the test environment to meet the requirements, it is put into the production environment. Therefore, in the traditional ASR technology, because the pre-accumulated limited corpus training ASR model is used, in the actual production environment, the user's query voice is ever-changing.
  • the answer voice cannot cover all the query voices.
  • the ASR model cannot achieve accurate speech recognition. Therefore, in the traditional ASR training system, even when the ASR model is trained, the training effect of the ASR model can meet the requirements. In the production environment, there are also problems of inaccurate answers due to the low accuracy of speech recognition.
  • the ASR model is trained repeatedly based on the corpus generated in the production environment on a regular basis to realize the update of the ASR model.
  • the training efficiency of the ASR model is low, resulting in the inability to improve the production environment in time.
  • the ASR model improves the completion rate of dialogue responses by improving the accuracy of speech recognition, and reduces the quality of self-service of various robots.
  • This application provides a method, device, computer equipment, and computer-readable storage medium for dialog management based on speech recognition, which can solve the problem of lowering the quality of self-service service of various robots due to the low training efficiency of the ASR model in the traditional technology The problem.
  • the present application provides a method for processing dialogue management based on voice recognition.
  • the method includes: receiving user voice through a dialogue management system, so that the dialogue management system calls a first preset voice recognition model to Recognize the user's voice to obtain a recognition result, and respond to the user's voice according to the recognition result to complete the dialogue, and form the dialogue into a dialogue voice corpus; send the dialogue voice corpus through the first preset message middleware To the corpus tagging system, so that the corpus tagging system annotates the dialogue voice corpus through a preset voice corpus tagging tool to obtain annotated voice corpus; sends the annotated voice corpus to the second preset message middleware
  • a speech recognition model training system so that the speech recognition model training system uses the labeled speech corpus to train a second preset speech recognition model; judging whether the trained second preset speech recognition model meets the preset dialogue completion rate Condition, wherein the dialog completion rate is the ratio of the number of dialogs completed based on speech recognition in the preset time
  • the present application also provides a dialogue management processing device based on voice recognition, including: a dialogue unit for receiving user voice through a dialogue management system so that the dialogue management system can call a first preset voice recognition model Recognize the user voice to obtain a recognition result, and respond to the user voice according to the recognition result to complete a dialogue, and form the dialogue into a dialogue voice corpus; the labeling unit is used to pass the middle of the first preset message Send the dialogue voice corpus to the corpus labeling system, so that the corpus labeling system can mark the dialogue voice corpus through a preset voice corpus labeling tool to obtain the labeled voice corpus; the training unit is used to pass the second The preset message middleware sends the marked speech corpus to a speech recognition model training system, so that the speech recognition model training system uses the marked speech corpus to train a second preset speech recognition model; the judgment unit is configured to Determine whether the second preset speech recognition model after training satisfies the preset dialog completion rate condition, where the dialog completion rate is
  • the present application also provides a computer device, which includes a memory and a processor, and a computer program is stored on the memory.
  • the processor executes the computer program, the following steps are implemented: receiving a user through a dialog management system Voice, so that the dialog management system calls the first preset voice recognition model to recognize the user’s voice to obtain a recognition result, and responds to the user’s voice according to the recognition result to complete the dialog, and then Form a dialogue speech corpus; send the dialogue speech corpus to the corpus labeling system through the first preset message middleware, so that the corpus labeling system marks the dialogue speech corpus through the preset speech corpus labeling tool to obtain Annotated speech corpus; the annotated speech corpus is sent to the speech recognition model training system through the second preset message middleware, so that the speech recognition model training system uses the annotated speech corpus to perform the second preset speech recognition model Training; determine whether the second preset speech recognition model after training meets the preset dialog completion rate condition, where the dialog completion rate is the dialog completion rate is
  • the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps: The system receives the user’s voice so that the dialogue management system calls the first preset voice recognition model to recognize the user’s voice to obtain a recognition result, and responds to the user’s voice according to the recognition result to complete the dialogue.
  • the dialogue forms a dialogue speech corpus; the dialogue speech corpus is sent to a corpus labeling system through a first preset message middleware, so that the corpus labeling system marks the dialogue speech corpus through a preset speech corpus labeling tool , To obtain annotated speech corpus; send the annotated speech corpus to the speech recognition model training system through the second preset message middleware, so that the speech recognition model training system uses the annotated speech corpus to compare the second preset speech
  • the recognition model is trained; it is judged whether the second preset speech recognition model after training satisfies the preset dialog completion rate condition, where the dialog completion rate is the number of dialogs completed based on voice recognition within the preset period of time accounting for the preset time The ratio of the number of all dialogues in the segment; if the trained second preset speech recognition model satisfies the preset dialogue completion rate condition, replace the first preset with the trained second preset speech recognition model
  • a voice recognition model is set up for the dialogue management system to call to complete a new
  • This application provides a method, device, computer equipment, and computer-readable storage medium for dialogue management and processing based on speech recognition. Since the accuracy of speech recognition is directly related to the speech corpus used to train the language model, this application couples the dialogue management system, the corpus labeling system, and the speech recognition model training system, which can combine the real speech generated by the dialogue management system
  • the corpus is sent to the corpus labeling system for labeling in a timely manner, and the model training system uses the labelled real voice corpus to train the second preset speech recognition model in real time.
  • the recognition model training system is divided into separate processing methods.
  • the recognition accuracy when training the speech recognition model can be improved.
  • the accuracy of speech recognition in dialog management thereby increasing the completion rate of the dialog, especially the self-service completion rate of the intelligent customer service robot.
  • FIG. 1 is a schematic flowchart of a dialog management processing method based on voice recognition provided by an embodiment of this application;
  • FIG. 2 is a schematic diagram of a specific embodiment of the dialog management processing method based on voice recognition provided by an embodiment of the application;
  • FIG. 3 is a schematic diagram of a sub-flow of a dialog management processing method based on voice recognition provided by an embodiment of the application;
  • FIG. 4 is a schematic block diagram of a dialog management processing apparatus based on voice recognition provided by an embodiment of the application.
  • Fig. 5 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • FIG. 1 is a schematic flowchart of a dialog management processing method based on voice recognition provided by an embodiment of the present application. As shown in Figure 1, the method includes the following steps S101-S105:
  • S101 Receive a user voice through a dialogue management system, so that the dialogue management system calls a first preset voice recognition model to recognize the user voice to obtain a recognition result, and responds to the user voice according to the recognition result
  • the dialogue is formed into a dialogue voice corpus.
  • the speech recognition model training system requires real context corpus to improve the accuracy of speech recognition model training, and the dialog management system can provide real dialogs corresponding to the scenarios in which the user handles the business, and the user’s voice is received through the dialog management system.
  • the dialogue management system calls a first preset voice recognition model to recognize the user's voice to obtain a recognition result.
  • the dialogue management system calls a first ASR (English: Automatic Speech Recognition, voice recognition) model and a first NLU
  • the model English: Natural Language Understanding, NLU (all currently used models) recognizes the user's voice to obtain the recognition result corresponding to the user's voice, and responds to the user's voice according to the recognition result to complete the dialogue , So as to realize the interaction between the user and the intelligent voice computer equipment, and finally the dialogue is formed into a dialogue voice corpus.
  • the speech corpus labeling is the labeling of the speech corpus by providing the speech recognition model with the speech corpus required by the speech recognition model for recognition in natural language processing.
  • the speech corpus labeling includes the ASR labeling method to label the dialogue speech corpus.
  • Dialogue speech corpus with NLU tagging method include using Praat tools, Transcriber tools, and Transcriber tools to tag voice corpora.
  • NLU tagging methods can use corpus tagging tools Brat corpus tagging tools, Prodigy tools or YEDDA Use tools and other methods to mark the speech corpus.
  • FIG. 2 is a schematic diagram of a specific embodiment in the dialog management processing method based on speech recognition provided by an embodiment of the application. As shown in FIG.
  • the dialogue management system will report the text of the interactive human-machine dialogue, AsrSessionID (ASR dialogue ID), the time point of the interaction, the interactive result information (whether it is successful, transfer labor, etc.) and auxiliary information (such as the business handled)
  • ASR dialogue ID AsrSessionID
  • auxiliary information such as the business handled
  • Non-sensitive information such as process name, telephone area code, gender, etc.
  • MQ Message Queue in English, MQ for short, also known as message middleware
  • MQ Message Queue in English, MQ for short, also known as message middleware
  • the ASR model and dialog management system do not form an integrated collaboration.
  • the embodiment of this application decouples ASR products and dialog management systems. Forming an integrated collaboration, since the speech recognition model is trained directly through the real speech corpus generated by the dialogue management system, the training efficiency of the speech recognition model and the accuracy of speech recognition can be improved.
  • the dialogue speech corpus is sent to the corpus labeling system through the first preset message middleware, so that the corpus labeling system marks the dialogue speech corpus through the preset speech corpus labeling tool to obtain the initial label
  • the speech corpus may further be manually operated on the initial annotated speech corpus to receive operations on the initial annotated speech corpus.
  • the operations include revision and confirmation to obtain the annotated speech corpus. That is to say, for the labeled corpus, semi-automatic labeling can be used.
  • the user’s recording is transcribed into text through the ASR engine for labeling, and then the labeling staff will check whether the results meet the requirements. If it is accurate, no operation is required. If there is an error, The annotator needs to modify it to the correct text, and then after the annotator confirms that it is correct, the speech recognition model is trained on the confirmed corpus, so as to ensure the accuracy of the annotation and improve the accuracy of the speech recognition.
  • S103 Send the marked speech corpus to a speech recognition model training system through the second preset message middleware, so that the speech recognition model training system uses the marked speech corpus to train a second preset speech recognition model.
  • the second preset speech recognition model and the first preset speech recognition model may be the same or different, and may be constructed based on the same speech recognition model, or may be constructed based on different speech recognition models.
  • the corpus tagging system pushes the tagged corpus to the file server, and then the model training system obtains the tagged speech corpus from the file server, and can use the tagged speech corpus to train the speech recognition model, including: Training for the ASR model and training for the NLU model.
  • the language model of ASR is trained.
  • the language model of ASR adopts the neural network language model.
  • the ASR model needs to refer to the context, that is, to refer to the question of the machine.
  • the model is more accurate in the expression of semantics. For example, the result of phonetic conversion is "Credit card emergency" is quickly corrected to "credit card activation" through the neural network language model.
  • the field of language model is relatively strong, so different services will use different language models, and the voice corpus of this application is the direct real-time source
  • using the real speech corpus generated in the real business context to train the speech recognition model will greatly improve the speech recognition rate of the speech recognition model under the business scenario, and realize the closeness of the speech recognition model and the real business scenario
  • the trained voice recognition model has a consistent match with the real business scenario, and the training of the voice recognition model is specific to the business scenario.
  • the NLU model will also be trained.
  • the training of the NLU model requires a combination of context, so as to meet the characteristics of the language model used in the speech recognition model and the business field. It can be implemented according to different business scenarios. To train the language model in speech recognition can improve the accuracy and efficiency of speech recognition, thereby especially improving the completion rate of autonomous services.
  • the user’s voice is received through the dialogue management system, and the first voice recognition model (for example, the first voice recognition model includes the first ASR model and the first NLU model) is used to Speech recognition is used to obtain the recognition result, and respond according to the recognition result to form the dialogue speech corpus in the interaction process, send the dialogue speech corpus to the corpus labeling system, and the corpus labeling system receives the dialogue speech corpus and responds to the dialogue speech
  • the corpus is annotated to obtain annotated speech corpus and sent to the file server.
  • the model training system obtains the annotated speech corpus from the file server, and uses the annotated corpus to perform a second speech recognition model (for example, the second speech recognition model includes The second ASR model and the second NLU model) are trained, the dialogue management system and the corpus labeling system use MQ to communicate, and the corpus labeling system and the model training system use MQ to communicate, due to the dialogue voice generated by the dialogue management system
  • the corpus belongs to the scenario of handling a complete business (such as applying for a fixed credit card limit, etc.).
  • the semantic context in the dialogue voice corpus is closely related.
  • the dialogue voice corpus trained on the speech recognition model contains the contextual relationship, and the business appeal is the customer Each entry can be processed.
  • the corpus Since the corpus has the connotative relationship corresponding to the internal context, the corpus and results generated in the processing process are fed back to the ASR system and NLU model in the NLP in time, so that the model training system can be used in a timely manner
  • Real contextual corpus training ASR model and NLU model can improve the accuracy of speech recognition of ASR model and NLU model in time, and can be adjusted in time according to the business scenario applied by the dialogue management system, so that it can be timely The accuracy of speech recognition is improved. Therefore, when the embodiments of the present application are applied to self-service, the success rate of customers in handling business can be improved, and the waste of business handling resources can be avoided.
  • the dialogue management system since the dialogue management system, the corpus labeling system and the model training system are all separated, the efficiency is relatively low and the timeliness is also lagging behind the manual de-guiding and labeling of the data.
  • the dialogue management The three systems, the corpus labeling system and the model training system, are decoupled through MQ. While realizing the non-interference of each other's operation, the dialogue management system, the corpus labeling system and the model training system can be integrated.
  • the business model corresponding to the business scenario has corresponding real speech corpus, and if these business scenarios are new, they are operated in the production environment through dialogue management. Only after real customers handle it will there be corpus of the real scenario.
  • the speech recognition model is trained on the corpus, so as to ensure the reliability of the speech recognition model relative to the business scenario, and avoid the corpus of these new scenarios when the speech recognition model in the traditional technology is trained.
  • Such a closed feedback loop is not formed, and the traditional In technology, only robots that belong to the question and answer category can only do one question and one answer type.
  • S104 Determine whether the second preset speech recognition model after training satisfies a preset dialog completion rate condition, where the dialog completion rate accounts for all the dialogs completed based on voice recognition in the preset time period. The ratio of the number of conversations.
  • step S105 If the trained second preset speech recognition model satisfies the preset dialog completion rate condition, replace the first preset speech recognition model with the trained second preset speech recognition model for The dialogue management system calls to complete a new dialogue, and if the trained second preset speech recognition model does not meet the preset dialogue completion rate condition, continue to use the new dialogue speech corpus generated in step S101 to compare the first The second preset speech recognition model is trained until the second preset speech recognition model after training satisfies the preset dialog completion rate condition.
  • the dialog completion rate is the ratio of the number of dialogs completed based on voice recognition in the preset time period to the number of all dialogs in the preset time period
  • the preset dialog completion rate condition refers to the second preset voice recognition model in the preset time period.
  • the second preset voice recognition model is trained for a preset period, such as one month or six months later, using the voice corpus generated by the real scene of the user handling the business generated by the dialogue management system, and then the second preset voice recognition model is judged.
  • the preset speech recognition model satisfies the preset dialogue completion rate condition, and if the trained second preset speech recognition model satisfies the preset dialogue completion rate condition, adopt the trained second preset speech recognition
  • the model replaces the first preset voice recognition model, and when the dialogue management system receives the voice service for the user to handle the business, it calls the second preset voice recognition model to recognize the voice of the user for the business, that is, the dialogue management system calls The second preset voice recognition model after training completes a new dialogue.
  • the second preset speech recognition model after training adopts the speech corpus generated by the real-time user handling business scene generated by the dialogue management system in real time, it is more in line with the actual needs of real business handling, and the trained second speech recognition model also It is more adaptable to the real scenario of handling the business, so as to improve the accuracy of speech recognition in handling the business, improve the quality of the dialogue, and thus increase the completion rate of the dialogue.
  • the embodiment of the application receives the user voice through the dialogue management system, uses the first preset voice recognition model to recognize the user voice to obtain the recognition result, and responds to the user voice according to the recognition result to form an interactive dialogue voice
  • the corpus the dialogue speech corpus is sent to the corpus labeling system through the message middleware
  • the corpus labeling system receives the dialogue speech corpus, annotates the dialogue speech corpus to obtain the labelled speech corpus
  • the model training system obtains the labelled speech corpus
  • Use the labeled speech corpus to train a second preset speech recognition model, and determine whether the trained second preset speech recognition model satisfies the preset dialog completion rate condition, and if the trained second preset speech recognition
  • the model satisfies the preset dialog completion rate condition, and the second preset voice recognition model after training is used to replace the first preset voice recognition model for the dialog management system to call to complete a new dialog.
  • the recognition accuracy is directly related to the speech corpus used to train the language model.
  • the dialogue management system can be The generated real speech corpus is sent to the corpus labeling system for annotation in time, and the model training system uses the annotated real speech corpus to train the second preset speech recognition model in real time.
  • the dialogue management system and corpus The labeling system and the speech recognition model training system are separated and processed separately. Since the real speech corpus generated in each business scenario can be used to train the language model in speech recognition in a timely manner in the embodiment of this application, the training of the speech recognition model can be improved. Time recognition accuracy and accuracy, in order to improve the accuracy of speech recognition in dialogue management, thereby increasing the completion rate of the dialogue, especially the self-service completion rate of the intelligent customer service robot.
  • the dialogue speech corpus includes a plurality of speech corpora formed corresponding to each of several dialogues, and the speech corpus formed for each dialogue includes the interactive result corresponding to the dialogue, and the interactive result includes the transfer Manually
  • the step of labeling the dialogue speech corpus by using a preset speech corpus labeling tool to obtain the marked speech corpus includes: recognizing that the interaction result is a transferred human speech corpus; adding the dialogue speech corpus to The interactive result is that the artificial voice corpus is transferred and eliminated to obtain the screened dialogue voice corpus; the screened dialogue voice corpus is labeled by a preset voice corpus labeling tool to obtain the labeled voice corpus.
  • the speech corpus whose interaction result is a successful interaction is used as the corpus of the training model, and only the speech corpus whose interaction result is a successful interaction is used as the corpus for training the speech recognition model, can the training efficiency and training accuracy of the speech recognition model be further improved.
  • the interaction result is manual transfer
  • the follow-up business personnel need to check the reason and retrain the speech recognition model artificially.
  • whether the interaction result is transferred to labor can be judged by assigning values to the field.
  • the interaction result is transferred to labor
  • the field "R” corresponding to the interaction result is assigned the value "0”
  • the interaction result is not transferred to labor
  • self-service The service and customer interaction is successful
  • the field "R” corresponding to the interaction result is assigned the value "1" and so on.
  • FIG. 3 is a schematic diagram of a sub-flow of the dialog management processing method based on speech recognition provided by an embodiment of the application.
  • the dialog management system calls the first preset
  • the voice recognition model recognizes the user voice to obtain a recognition result, and responds to the user voice according to the recognition result to complete a dialogue.
  • the step of forming the dialogue into a dialogue voice corpus includes:
  • S301 Receive the first voice corresponding to the user voice, and generate a preset dialogue coding identifier of the dialogue corresponding to the user voice, where the preset dialogue coding identifier may be a dialogue serial number, including the date and time of the dialogue, and the dialogue Sequence number, connected self-service machine number, and other machine equipment elements, time elements, and user elements involved in the dialogue are generated according to the first preset sequence, including the date and time of the dialogue, the dialogue sequence number, and the connected self-service machine number
  • the character strings of the machine and equipment elements, time elements, and user elements involved in the dialogue can generate the preset dialogue coding identifier of the dialogue corresponding to the user's voice.
  • ASR dialogue code identifier is the serial number of the call to the ASR model, including the preset dialogue code identifier, the date and time of the call, and the number of calls it belongs to.
  • a character string including the preset dialogue code identifier, the date and time of the call, and the number of calls corresponding to the call times can be generated according to the second preset sequence to generate the ASR dialogue code identifier.
  • S303 Invoke the first preset NLU model to understand the user text to obtain user semantics.
  • S304. According to the user semantics, filter the preset responses corresponding to the user semantics from the preset database through a preset semantic matching method, wherein the semantic matching includes semantic exact matching and semantic fuzzy matching, and the semantics are accurate Matching means that the preset answer in the database contains a semantic matching method that is exactly the same as the semantics recognized in the user’s voice.
  • Semantic fuzzy matching means that the preset answer in the database contains the same semantics as the recognized semantics in the user’s voice or Similar semantic matching method.
  • S306 Determine whether the user voice is over. S307. If the user voice has not ended, receive the second voice corresponding to the user voice, and iteratively execute the step of calling the first preset ASR model according to the preset dialog coding identifier, until the user voice ends , To complete the dialogue, and proceed to step S308. S308: If the user voice ends, complete the dialogue. S309. Form a dialogue voice corpus from the user voice and the preset answer, where the dialogue voice corpus includes a preset dialogue coding identifier and the ASR dialogue coding identifier.
  • ASR Automatic Speech Recognition
  • HMM hidden Markov model
  • DNN deep neural network
  • the user needs to continuously interact with the self-service voice service, for example, through the form of one question and one answer.
  • the dialogue management system starts to accept the user’s voice
  • the first time the user’s voice is received Voice generate the preset dialogue coding identifier of the call
  • the preset dialogue coding identifier is used to track the call for handling the business, according to the preset dialogue coding identifier, call the first preset ASR model to pass the first A preset ASR model converts the first speech into the first user text, and generates a first ASR dialog coding identifier corresponding to the first speech.
  • the first ASR dialog coding identifier is used to describe the first speech Correspondingly call the ASR model, call the first preset NLU model to understand the first user text to obtain the first user semantics, according to the first user semantics, filter from the preset database through a preset semantic matching method A first preset response corresponding to the semantics of the first user is generated, and the first preset response is converted into a first response voice to respond to the first voice, and determine whether the user voice is over If the user's voice is not over, receive the second voice of the user's voice, and continue to call the first preset ASR model to convert the second voice into a second user through the first preset ASR model Text, and generate the second ASR dialogue coding identifier corresponding to the second voice.
  • the second ASR dialogue coding identifier is used to describe the call to the ASR model corresponding to the second voice, and continue to call the first preset NLU model
  • the second user text is understood to obtain the second user semantics, and according to the second user semantics, the second preset corresponding to the second user semantics is continuously filtered from the preset database through a preset semantic matching method Answer, convert the second preset answer into a second response voice to respond to the second voice, and again determine whether the user voice is over, and if the user voice is not over, continue to receive the user voice
  • the dialogue management system can generate a UniqueID.
  • This UniqueID is an ID used to mark a dialogue.
  • the dialogue management system will record every user's words and the answers of the dialogue management system. And every time the ASR model is called to convert the user's words into words, an AsrSessionID will be generated. This AsrSessionID is used to mark an ASR interaction.
  • the ASR system will send the result of the phonetic conversion to the dialogue management system, and the dialogue management system will The text calls the NLU model to understand, and according to the results of the understanding, selects the pre-set corresponding answers from the database to respond, so as to realize the interaction between the user and the voice service self-service computer equipment, and the dialogue in the interactive process is formed into a dialogue voice corpus. Therefore, in In this whole process, the ASR model and the NLU model are very important. For a conversation and each ASR call, set the dialog ID separately, so that the context of a dialog can be associated with the dialog ID to form a complete interaction process, so as to facilitate the subsequent ASR and The NLU model realizes learning according to the context, thereby improving the accuracy of speech recognition model training.
  • the corpus tagging system annotates the dialog speech corpus by using a preset speech corpus tagging tool to obtain the tagged speech corpus includes: tagging the dialog speech corpus using a preset ASR tagging method , To obtain an ASR-labeled speech corpus; use a preset NLU labeling method to label the dialogue speech corpus to obtain an NLU-labeled speech corpus;
  • the step of the voice recognition model training system using the labeled voice corpus to train a second preset voice recognition model includes: the voice recognition model training system obtains the ASR labeled voice corpus and the NLU labeled voice corpus; using the The ASR annotated speech corpus is used to train the second preset ASR model, and the NLU annotated speech corpus is used to train the second preset NLU model.
  • the ASR model is an acoustic model, which integrates the knowledge of acoustics and pronunciation, so as to convert sound into text.
  • the computer device When performing speech recognition to convert the sound into text, because the voice emitted by the sound is a continuous sound, the computer device does not know which part of the speech corresponds to which phoneme or word. It needs to be marked by ASR first.
  • the dialogue speech corpus is annotated, so that the speech can be automatically segmented into phonemes or words, and then the speech factors or words are correspondingly converted into text, so as to realize the conversion of the speech into text through the ASR model.
  • the ASR labeling method is the dialogue
  • the voice annotation method of the speech corpus can be realized by ASR annotation through the voice annotation tool.
  • the voice annotation tool includes the Praat annotation method corresponding to the Praat tool and the Transcriber annotation method corresponding to the Transcriber tool.
  • the NLU model is a language model, which is used to learn the relationship between words and words through training corpus to estimate the possibility of hypothetical word sequence, also called language model score, which reflects the composition of words by words and sentences by words.
  • language model score which reflects the composition of words by words and sentences by words.
  • the language model can usually achieve a more accurate estimation of the language.
  • Language models include SRILM, IRSTLM, MITLM and BerkeleyLM. To convert text into words and sentences that generally describe meaning through the NLU model, the resulting text needs to be annotated, so that the annotated text can be used to form meaningful words and sentences through the NLU model.
  • the obtained text needs to be annotated by the preset NLU annotation method to obtain the NLU-annotated speech corpus, so that the NLU-annotated speech corpus is converted into words and sentences with meaning content using the NLU model.
  • the final realization of the speech conversion into the commonly used text language can be used to tag the speech corpus with the corpus tagging tool.
  • the corpus tagging tools include the Brat corpus tagging method corresponding to the Brat corpus tagging tool and the Parker corpus tagging tool.
  • the Parker corpus tagging method The Parker corpus tagging method, the YEDDA corpus tagging method corresponding to the YEDDA corpus tagging tool, the Snorkel corpus tagging method corresponding to the Snorkel corpus tagging tool, and the Prodigy corpus tagging method corresponding to the Prodigy corpus tagging tool.
  • the user's voice is received through the dialogue management system, the first ASR model and the first NLU model are used to recognize the user's voice to obtain a recognition result, and a response is made according to the recognition result to form a dialogue voice Corpus, sending the dialogue speech corpus to a corpus labeling system, the corpus labeling system receives the dialogue speech corpus, annotates the dialogue speech corpus to obtain a labelled speech corpus, and performs a preset ASR labeling method on the dialogue speech corpus Annotate to obtain an ASR annotated speech corpus, annotate the dialogue speech corpus using a preset NLU annotation method to obtain an NLU annotated speech corpus, and send it to the file server, and the model training system obtains the annotated speech from the file server
  • the corpus is used to train the second ASR model and the second NLU model using the ASR annotated speech corpus and the NLU annotated speech corpus, so that the real corpus is used to train the ASR
  • the speech corpus used in the language model has a great relationship. For each business, using the real speech corpus generated by the respective business scenario to train the corresponding ASR model of the respective business can improve the accuracy of the ASR model for speech recognition. Due to the ASR model The increase in the accuracy of speech recognition will also promote the accuracy of the understanding of the NLU model, and ultimately improve the accuracy of the entire speech recognition, so as to finally improve the self-service completion rate of intelligent customer service robots.
  • the dialog management system calls the second preset ASR model and The second preset NLU model recognizes the received new user's voice, and responds to the new user's voice to complete the dialogue; and counts the first preset ASR model and the first preset within a preset time period
  • the NLU model recognizes the first completion rate of the dialogue completed by the user’s voice; the second preset ASR model and the second preset NLU model perform statistics on the new user’s voice during the preset time period. Identify the second completion rate of completed conversations;
  • the step of judging whether the second preset speech recognition model after training satisfies a preset dialog completion rate condition includes: judging whether the second completion rate is greater than the first completion rate; if the second completion rate is greater than all According to the first completion rate, it is determined that the second preset speech recognition model after training satisfies the preset dialog completion rate condition.
  • the NLU model separately counts the completion rate of each in the same time period, determines whether the second completion rate is greater than the first completion rate, and if the second completion rate is greater than the first completion rate, determine the training After the second preset speech recognition model satisfies the preset dialog completion rate condition, the second preset ASR model is replaced with the first preset ASR model, and the second preset NLU model is replaced with the first Preset NLU model.
  • the completion rate of each self-service voice service is counted, that is, the statistical interaction result is a success without the need for manual self-service voice service results corresponding to Dialogue, if after using the trained second preset ASR model and the second preset NLU model, the self-service voice service completion rate is improved, the second preset ASR model and the first preset ASR model will be used. 2.
  • the completion rate of self-service voice services is not improved, then continue to use all
  • the first preset ASR model and the old model corresponding to the first preset NLU model continue to train for the second preset ASR model and the second preset NLU model.
  • the dialogue management system counts the completion rate of each self-service voice service every month. If the second preset voice recognition model is trained and the completion rate is improved, the second preset voice will be used.
  • Recognition model otherwise, continue to use the first preset voice recognition model to complete the user's business dialogue, and continue to train for the second preset voice recognition model.
  • the embodiment of this application uses the real voice corresponding to the service for each service. Recognize the corpus to train different ASR neural network language models. The accuracy of speech recognition is improved, which will also promote the correct rate of understanding of the NLU model, so that a business-specific corpus annotation system and model training system can be built, especially for customer service robots It is extremely important to improve the accuracy and efficiency of speech recognition by building a business-specific corpus labeling system and model training system, so that it can be accurately recognized in various types of professional services, and ultimately improve the self-service completion of intelligent customer service robots. rate.
  • FIG. 4 is a schematic block diagram of a dialog management processing apparatus based on voice recognition provided by an embodiment of the present application.
  • an embodiment of the present application also provides a dialog management processing device based on voice recognition.
  • the voice recognition-based dialog management processing device includes a unit for executing the above-mentioned voice recognition-based dialog management processing method, and the voice recognition-based dialog management processing device may be configured in a computer device.
  • the dialogue management processing device 400 based on voice recognition includes a dialogue unit 401, a labeling unit 402, a training unit 403, a judgment unit 404 and a replacement unit 405.
  • the dialogue unit 401 is configured to receive the user voice through the dialogue management system, so that the dialogue management system calls the first preset voice recognition model to recognize the user voice to obtain the recognition result, and compares the user voice according to the recognition result.
  • the user voice responds to complete the dialogue, and the dialogue is formed into a dialogue voice corpus;
  • the tagging unit 402 is configured to send the dialog voice corpus to the corpus tagging system through the first preset message middleware, so that the corpus The tagging system tags the dialogue speech corpus through a preset speech corpus tagging tool to obtain the tagged speech corpus;
  • the training unit 403 is configured to send the tagged speech corpus to the speech recognition model training through the second preset message middleware System, so that the speech recognition model training system uses the labeled speech corpus to train the second preset speech recognition model;
  • the judging unit 404 is used to judge whether the trained second preset speech recognition model satisfies the preset dialogue
  • the completion rate condition wherein the dialog completion rate is
  • the dialogue speech corpus includes several speech corpora formed corresponding to each of several dialogues, and the speech corpus formed for each dialogue includes the interactive result corresponding to the dialogue, and the interactive result includes the transfer
  • the labeling unit 402 includes: a recognition sub-unit for recognizing that the interaction result is a voice corpus of a transfer human; a culling subunit is used to convert the interaction result in the dialogue voice corpus to a voice corpus of the transfer human The speech corpus is eliminated to obtain the screened conversational speech corpus; the labeling subunit is used to label the screened conversational speech corpus through a preset speech corpus labeling tool to obtain the labeled speech corpus.
  • the dialogue unit 401 includes: a first receiving subunit, configured to receive the first voice corresponding to the user's voice, and generate a preset dialogue coding identifier of the dialogue corresponding to the user's voice;
  • the subunit is configured to call a first preset ASR model according to the preset dialog coding identifier to convert the first speech into user text through the first preset ASR model, and based on the preset
  • the dialog coding ID generates the ASR dialog coding ID corresponding to the call;
  • the second calling subunit is used to call the first preset NLU model to understand the user text to obtain the user semantics;
  • the screening subunit is used to User semantics, and filter the preset responses corresponding to the user semantics from the preset database through a preset semantic matching method;
  • the response sub-unit is used to convert the preset responses into response voices, and convert the responses
  • the voice is sent to the user to respond to the first voice;
  • the first judging subunit is used to judge whether the user’s voice
  • the second voice corresponding to the user’s voice iteratively execute the steps of invoking the first preset ASR model according to the preset dialogue coding identifier until the end of the user’s voice to complete the dialogue; a dialogue subunit is formed for if The user voice ends, the dialogue is completed, and the user voice and the preset answer are formed into a dialogue voice corpus, wherein the dialogue voice corpus includes the preset dialogue coding identifier and the ASR dialogue coding identifier.
  • the tagging unit 402 includes: a first tagging subunit, configured to tag the dialogue speech corpus using a preset ASR tagging method to obtain an ASR tagging speech corpus; a second tagging subunit, using Labeling the dialogue speech corpus using a preset NLU labeling method to obtain an NLU labeling speech corpus;
  • the training unit 403 includes: an acquisition subunit for the speech recognition model training system to acquire the ASR annotated speech corpus and the NLU annotated speech corpus; a training subunit for using the ASR annotated speech corpus to perform a second prediction It is assumed that the ASR model is trained, and the second preset NLU model is trained using the NLU-labeled speech corpus.
  • the apparatus 400 for processing dialogue management based on speech recognition further includes: a calling unit for the dialogue management system to call the second preset ASR model and the second preset NLU model to receive To recognize the new user’s voice of the new user, and respond to the new user’s voice to complete the dialogue; a first statistics unit, configured to count the first preset ASR model and the first preset NLU model within a preset time period The first completion rate of the dialogue completed by recognizing the user’s voice; a second statistical unit configured to count the comparison between the second preset ASR model and the second preset NLU model within the preset time period State the second completion rate of the dialogue completed by the new user’s voice recognition;
  • the judging unit 404 includes: a second judging subunit for judging whether the second completion rate is greater than the first completion rate; and a judging subunit for determining whether the second completion rate is greater than the first completion rate. Rate, determining that the second preset speech recognition model after training satisfies the preset dialog completion rate condition.
  • the division and connection of the units in the speech recognition-based dialogue management processing device are only used for illustration.
  • the speech recognition-based dialogue management processing device can be divided into different units as needed.
  • the units in the dialog management processing device based on voice recognition can adopt different connection sequences and ways to complete all or part of the functions of the dialog management processing device based on voice recognition.
  • the foregoing apparatus for dialog management and processing based on voice recognition may be implemented in the form of a computer program, and the computer program may run on a computer device as shown in FIG. 5.
  • FIG. 5 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 500 may be a computer device such as a desktop computer or a server, or may be a component or component in other devices.
  • the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504, and the memory may also be volatile storage. medium.
  • the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the processor 502 can execute the above-mentioned dialog management processing method based on voice recognition.
  • the processor 502 is used to provide calculation and control capabilities to support the operation of the entire computer device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503.
  • the processor 502 can make the processor 502 execute the above-mentioned dialog management processing method based on voice recognition .
  • the network interface 505 is used for network communication with other devices.
  • the specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the computer device may only include a memory and a processor. In such embodiments, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 5, and will not be repeated here.
  • the processor 502 is configured to run a computer program 5032 stored in a memory, so as to implement the dialog management processing method based on voice recognition described in the embodiment of the present application.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and special purpose processors.
  • Integrated circuit Application Specific Integrated Circuit, ASIC
  • off-the-shelf programmable gate array Field-Programmable Gate Array, FPGA
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium, or may be a volatile computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program is executed by the processor When the processor executes the steps of the dialog management processing method based on voice recognition described in the above embodiments.
  • the storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, etc., which can store computer programs. medium.
  • a physical, non-transitory storage medium such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, etc., which can store computer programs. medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

一种基于语音识别的对话管理处理方法、装置(400)、计算机设备(500)及计算机可读存储介质。涉及人工智能技术领域,通过对话管理系统接收用户语音,使用第一预设语音识别模型对用户语音进行识别,根据识别结果对用户语音进行回应,形成对话语音语料,发送对话语音语料至语料标注系统,语料标注系统对对话语音语料进行标注以得到标注语音语料,模型训练系统获取标注语音语料,使用标注语音语料对第二预设语音识别模型进行训练,判断训练后的第二预设语音识别模型是否满足预设对话完成率条件,若满足预设对话完成率条件,采用训练后的第二预设语音识别模型以供对话管理系统调用而完成新对话,能提高对话完成率。

Description

基于语音识别的对话管理处理方法、装置、设备及介质
本申请要求于2020年06月16日提交中国专利局、申请号为202010550379.9、申请名称为“基于语音识别的对话管理处理方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种基于语音识别的对话管理处理方法、装置、计算机设备及计算机可读存储介质。
背景技术
随着语音识别技术的发展,尤其是语音识别技术应用于自助服务等机器人上面,对语音识别的识别效果提出了更高的要求,尤其对于各类客服机器人,在各种应用场景下所对应的专业服务上,需要对用户的语音识别准确。
进行语音识别,一般通过语音识别模型进行识别,比如通过ASR(英文为Automatic Speech Recognition,语音识别)模型进行识别。ASR模型识别率,主要由声学模型和语言模型决定,通过声学模型,产生词序列后,再经过语言模型,选择最符合正常说法的一组词序列,作为最终的音转字结果。传统技术中,对ASR模型训练,主要是采用已积累的训练样本语料进行训练,对ASR模型识别训练经开发环境和测试环境验证满足要求后,投入到生产环境中。因此,传统ASR技术中,由于是采用预先已积累的有限语料训练的ASR模型,但在实际的生产环境中,用户的询问语音是千变万化的,针对用户的询问语音,回答语音不能覆盖所有询问语音的情形,而针对训练ASR模型时未覆盖的询问语音模型,ASR模型不能实现准确语音识别。因此,传统的ASR训练系统中,即使对ASR模型进行训练时,ASR模型的训练效果能够满足要求,在生产环境中,也存在着由于语音识别准确率较低而导致回答不准确的问题,需要定期根据生产环境中产生的语料反复的去训练ASR模型,才能实现ASR模型的更新。
发明人意识到,由于传统技术中ASR模型不能实现及时更新,无形中延长了ASR训练周期,降低了对ASR模型的训练效率,而对ASR模型的训练效率较低,导致不能及时提高生产环境中ASR模型在对话管理中,通过提高语音识别的准确性而实现提高对话回答的完成率,降低了各种机器人的自助服务质量。
发明内容
本申请提供了一种基于语音识别的对话管理处理方法、装置、计算机设备及计算机可读存储介质,能够解决传统技术中对ASR模型的训练效率较低而导致降低了各种机器人的自助服务质量的问题。
第一方面,本申请提供了一种基于语音识别的对话管理处理方法,所述方法包括:通过对话管理系统接收用户语音,以使所述对话管理系统调用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,并根据所述识别结果对所述用户语音进行回应以完成对话,将所述对话形成对话语音语料;通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,以使所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料;通过第二预设消息中间件将所述标注语音语料发送至语音识别模型训练系统,以使所述语音识别模型训练系统使用所述标注语音语料对第二预设语音识别模型进行训练;判断训练后的第二预设语音识别模型是否满足预设对话完成率条件,其中,对话完成率为该预设时间段内基于语音识别所完成的对话数量占该预设时间段内所有对话数量的比例;若所述训练后的第二预设语音识别模型满足所述预设对话完成率条件,采用所述训练后的第二预设语音识别模型替换所述第一预设语音识别模型以供所述对话管理系统调用而完成新的对话。
第二方面,本申请还提供了一种基于语音识别的对话管理处理装置,包括:对话单元,用于通过对话管理系统接收用户语音,以使所述对话管理系统调用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,并根据所述识别结果对所述用户语音进行回应以完 成对话,将所述对话形成对话语音语料;标注单元,用于通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,以使所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料;训练单元,用于通过第二预设消息中间件将所述标注语音语料发送至语音识别模型训练系统,以使所述语音识别模型训练系统使用所述标注语音语料对第二预设语音识别模型进行训练;判断单元,用于判断训练后的第二预设语音识别模型是否满足预设对话完成率条件,其中,对话完成率为该预设时间段内基于语音识别所完成的对话数量占该预设时间段内所有对话数量的比例;替换单元,用于若所述训练后的第二预设语音识别模型满足所述预设对话完成率条件,采用所述训练后的第二预设语音识别模型替换所述第一预设语音识别模型以供所述对话管理系统调用而完成新的对话。
第三方面,本申请还提供了一种计算机设备,其包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现如下步骤:通过对话管理系统接收用户语音,以使所述对话管理系统调用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,并根据所述识别结果对所述用户语音进行回应以完成对话,将所述对话形成对话语音语料;通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,以使所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料;通过第二预设消息中间件将所述标注语音语料发送至语音识别模型训练系统,以使所述语音识别模型训练系统使用所述标注语音语料对第二预设语音识别模型进行训练;判断训练后的第二预设语音识别模型是否满足预设对话完成率条件,其中,对话完成率为该预设时间段内基于语音识别所完成的对话数量占该预设时间段内所有对话数量的比例;若所述训练后的第二预设语音识别模型满足所述预设对话完成率条件,采用所述训练后的第二预设语音识别模型替换所述第一预设语音识别模型以供所述对话管理系统调用而完成新的对话。
第四方面,本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器执行如下步骤:通过对话管理系统接收用户语音,以使所述对话管理系统调用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,并根据所述识别结果对所述用户语音进行回应以完成对话,将所述对话形成对话语音语料;通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,以使所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料;通过第二预设消息中间件将所述标注语音语料发送至语音识别模型训练系统,以使所述语音识别模型训练系统使用所述标注语音语料对第二预设语音识别模型进行训练;判断训练后的第二预设语音识别模型是否满足预设对话完成率条件,其中,对话完成率为该预设时间段内基于语音识别所完成的对话数量占该预设时间段内所有对话数量的比例;若所述训练后的第二预设语音识别模型满足所述预设对话完成率条件,采用所述训练后的第二预设语音识别模型替换所述第一预设语音识别模型以供所述对话管理系统调用而完成新的对话。
本申请提供了一种基于语音识别的对话管理处理方法、装置、计算机设备及计算机可读存储介质。由于语音识别准确率与其采用的对语言模型进行训练的语音语料有直接关系,本申请通过将对话管理系统、语料标注系统及语音识别模型训练系统进行了耦合,能将对话管理系统产生的真实语音语料及时的发送给语料标注系统进行标注,并且使模型训练系统使用标注后的真实语音语料对第二预设语音识别模型进行实时训练,相比传统技术中将对话管理系统、语料标注系统及语音识别模型训练系统割裂以分别处理的方式,本申请实施例由于能够采用每一个业务场景下产生的真实语音语料对语音识别中的语言模型进行及时的训练,能够提高训练语音识别模型时的识别准确性,以提高对话管理中语音识别的准确性,从而提高对话完成率,尤其能够提升智能客服机器人的自助服务完成率。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的基于语音识别的对话管理处理方法的一个流程示意图;
图2为本申请实施例提供的基于语音识别的对话管理处理方法中一个具体实施例的示意图;
图3为本申请实施例提供的基于语音识别的对话管理处理方法的一个子流程示意图;
图4为本申请实施例提供的基于语音识别的对话管理处理装置的一个示意性框图;以及
图5为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
请参阅图1,图1为本申请实施例提供的基于语音识别的对话管理处理方法的一个流程示意图。如图1所示,该方法包括以下步骤S101-S105:
S101、通过对话管理系统接收用户语音,以使所述对话管理系统调用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,并根据所述识别结果对所述用户语音进行回应以完成对话,将所述对话形成对话语音语料。
具体地,语音识别模型训练系统需要真实的语境语料,才能提高语音识别模型训练的准确性,而对话管理系统可以提供用户办理业务所对应场景的真实对话,通过对话管理系统接收用户语音,所述对话管理系统调用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,比如,所述对话管理系统调用第一ASR(英文为Automatic Speech Recognition,语音识别)模型和第一NLU模型(英文为Natural Language Understanding,NLU))(均为当前使用的模型)对用户语音进行识别以得到用户语音所对应的识别结果,并根据所述识别结果对所述用户语音进行回应以完成对话,从而实现用户与智能语音计算机设备的交互,最后将所述对话形成对话语音语料。
S102、通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,以使所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料。
其中,语音语料标注为自然语言处理中为语音识别模型提供符合语音识别模型进行识别所需要的语音语料而对语音语料进行的标注,语音语料标注包括ASR标注方式对所述对话语音语料进行标注,及NLU标注方式对话语音语料进行标注,其中,实现ASR标注方式包括采用Praat工具、Transcriber工具及Transcriber工具等实现语音语料进行标注,NLU标注方式可以采用语料标注工具Brat语料标注工具、Prodigy工具或者YEDDA工具等方式对语音语料进行标注。
具体地,对话管理系统通过和用户进行交互得到对话语音语料后,通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,语料标注系统获取所述对话语音语料,通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料,并发送至文件服务器上。比如,请参阅图2,图2为本申请实施例提供的基于语音识别的对话管理处理方法中一个具体实施例的示意图,如图2所示,在该实施例中,每进行一次人机交互后,对话管理系统会将此次交互的人机对话的文字、AsrSessionID(即ASR对话ID)、交互的时间点、交互结果信息(是否成功及转接人工等)和辅助信息(例如办理的业务流程名称、电话的区号及性别等非敏感信息)通过MQ(英文为Message Queue,消息队列,简称MQ,又称为消息中间件)发送给语料标注系统,语料标注系统会根据对话进行标注,目前针对ASR语言模型的语料和NLU模型的语料分别进行标注。相对于传统技术中很多语料标注中将ASR产品和对话管理系统分离开来,而导致ASR模型和对话管理系统并没有形成一体协作,本申请实施例通过将ASR产品和对话管理系统进行解耦,形成一体协作,由于直接通过对话管理系统产生的真实语音语料进行语音识别模型的训练,能够提高语音识别模型进行训练的效率和语音识别的准确性。
进一步地,通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,以使所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到初始标注语音语料,还可以进一步通过人工对所述初始标注语音语料进行操作,从而接收对所述初始标注语音语料进行的操作,所述操作包括修订和确认,从而得到标注语音语料。即针对标注语料,可以采用半自动化方式进行标注,首先将用户的录音通过ASR引擎转写成文字进行标注后,然后标注人员去看检查这些结果是否符合要求,如果准确就无需操作,如果有误,需要标注人员修改为正确文字,然后经标注人员确认无误后,针对该确认语料进行语音识别模型的训练,从而保证标注的准确性,提高语音识别的准确性。
S103、通过第二预设消息中间件将所述标注语音语料发送至语音识别模型训练系统,以使所述语音识别模型训练系统使用所述标注语音语料对第二预设语音识别模型进行训练。
其中,所述第二预设语音识别模型和所述第一预设语音识别模型可以相同,也可以不同,可以基于相同的语音识别模型进行构建,也可以基于不同的语音识别模型进行构建。
具体地,语料标注系统针对标注好的语料推到文件服务器上,然后模型训练系统从文件服务器上获取所述标注语音语料,即可使用所述标注语音语料针对语音识别模型进行训练,其中,包括针对ASR模型进行训练和针对NLU模型进行训练。针对ASR的语言模型进行训练,ASR的语言模型采用神经网络语言模型,ASR模型需要参考上下文,也即需要参考机器的问话,该模型对语意的表达更为准确,例如音转字的结果为“信用卡急活”,通过神经网络语言模型快速修正为“信用卡激活”,同时语言模型的领域较强,因此不同的业务会采用不同的语言模型,而本申请的语音语料正是直接实时的来源于对话管理系统,这样采用真实业务语境中所产生的真实语音语料对语音识别模型进行训练,会大大提供语音识别模型对该业务场景下的语音识别率,实现语音识别模型和真实业务场景紧密结合,使训练出来的语音识别模型和真实业务场景具有一致的匹配性,对语音识别模型进行训练具有针对该业务场景的针对性。同时,也会给NLU模型进行训练,针对NLU模型的训练更是需要上下文的结合,从而符合语音识别模型中所采用的语言模型和业务领域关联性特别强的特点,可以实现根据不同的业务场景去训练语音识别中的语言模型,能够提高语音识别的准确性和效率,从而尤其提高自主服务的完成率。
请继续参阅图2,在本申请实施例中,通过对话管理系统接收用户语音,使用第一语音识别模型(比如,第一语音识别模型包括第一ASR模型和第一NLU模型)对所述用户语音识别以得到识别结果,并根据识别结果进行回应,从而形成交互过程中的对话语音语料,发送所述对话语音语料至语料标注系统,语料标注系统接收所述对话语音语料,对所述对话语音语料进行标注以得到标注语音语料,并发送至文件服务器上,模型训练系统从文件服务器上获取所述标注语音语料,使用所述标注语料进行第二语音识别模型(比如,第二语音识别模型包括第二ASR模型和第二NLU模型)进行训练,对话管理系统与语料标注系统之间利用MQ进行通信,语料标注系统与模型训练系统之间利用MQ进行通信,由于对话管理系统所产生的对话语音语料,属于办理一个完整业务的场景(如申请信用卡固定额度等),该对话语音语料中的语义上下文紧密相关,对语音识别模型进行训练的对话语音语料中包含有上下文的关系,业务诉求是客户每一次进线都能办理完成,由于语料有内在的上下文所对应的内涵关系,通过办理过程中产生的语料和结果及时反馈给NLP中的ASR系统和NLU模型,能够使模型训练系统及时的采用真实的有上下文关系的语境语料训练ASR模型和NLU模型,能够及时的提高ASR模型和NLU模型的语音识别准确率,能够及时根据该对话管理系统所应用的业务场景进行调整,从而能够及时的提高语音识别的准确率,因此,本申请实施例应用于自助服务时,能够提高客户办理业务的成功率,避免业务办理资源的浪费。
在传统技术中,由于对话管理系统、语料标注系统和模型训练系统都是割裂的,通过人工去导数据、做标注,效率比较低下,时效性也滞后,而本申请实施例中通过将对话管理系统,语料标注系统和模型训练系统这三个系统通过MQ进行解耦,在实现各自运行互不干涉的同时,实现对话管理系统,语料标注系统和模型训练系统的一体化,能够做到一种业务场 景对应的业务模型有对应的真实语音语料,并且这些业务场景如果是新的,通过对话管理去生产环境运作,有真实的客户来办理之后才会有真实场景的语料,这些语料才可能作为语音识别模型进行训练的语料,从而保证语音识别模型相对于业务场景的可靠性,避免传统技术中语音识别模型进行训练时压根就不存在这些新场景的语料,没有形成这样一个反馈闭环,从而传统技术中只能属于问答类的机器人,只能做到一问一答类型。
S104、判断训练后的第二预设语音识别模型是否满足预设对话完成率条件,其中,对话完成率为该预设时间段内基于语音识别所完成的对话数量占该预设时间段内所有对话数量的比例。
S105、若所述训练后的第二预设语音识别模型满足所述预设对话完成率条件,采用所述训练后的第二预设语音识别模型替换所述第一预设语音识别模型以供所述对话管理系统调用而完成新的对话,若所述训练后的第二预设语音识别模型不满足所述预设对话完成率条件,继续采用步骤S101产生的新对话语音语料对所述第二预设语音识别模型进行训练,直至所述训练后的第二预设语音识别模型满足所述预设对话完成率条件。
其中,对话完成率为该预设时间段内基于语音识别所完成的对话数量占该预设时间段内所有对话数量的比例,预设对话完成率条件指第二预设语音识别模型在预设时间段内自助完成的对话比例是否满足预期,例如,该对话比例是否大于或者等于预设比例值,或者第二预设语音识别模型在预设时间段内自助完成对话比例是否大于原来使用的语音识别模型在同样的预设时间段内自助完成的对话比例等。
具体地,采用对话管理系统所产生的用户办理业务的真实场景所产生的语音语料,对第二预设语音识别模型进行训练一个预设周期后,比如一个月或者半年后,判断训练后的第二预设语音识别模型是否满足预设对话完成率条件,若所述训练后的第二预设语音识别模型满足所述预设对话完成率条件,采用所述训练后的第二预设语音识别模型替换所述第一预设语音识别模型,对话管理系统再接收到用户办理业务的语音服务时,调用第二预设语音识别模型对用户办理业务的语音进行识别,即所述对话管理系统调用训练后的第二预设语音识别模型而完成新的对话。由于训练后的第二预设语音识别模型采用对话管理系统实时产生的用户办理业务的真实场景所产生的语音语料,更能符合真实的办理业务的实际需要,训练出来的第二语音识别模型也更能适应办理业务的真实场景,从而能提高办理业务中语音识别的准确性,提高对话质量,从而提高对话完成率。
本申请实施例通过对话管理系统接收用户语音,使用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,并根据识别结果对所述用户语音进行回应,以形成交互的对话语音语料,通过消息中间件发送所述对话语音语料至语料标注系统,语料标注系统接收所述对话语音语料,对所述对话语音语料进行标注以得到标注语音语料,模型训练系统获取所述标注语音语料,使用所述标注语音语料对第二预设语音识别模型进行训练,判断训练后的第二预设语音识别模型是否满足预设对话完成率条件,若所述训练后的第二预设语音识别模型满足所述预设对话完成率条件,采用所述训练后的第二预设语音识别模型替换所述第一预设语音识别模型以供所述对话管理系统调用而完成新的对话,由于语音识别准确率与其采用的对语言模型进行训练的语音语料有直接关系,而在本申请实施例中,由于将对话管理系统、语料标注系统及语音识别模型训练系统进行了耦合,能将对话管理系统产生的真实语音语料及时的发送给语料标注系统进行标注,并且使模型训练系统使用标注后的真实语音语料对第二预设语音识别模型进行实时训练,相比传统技术中将对话管理系统、语料标注系统及语音识别模型训练系统割裂以分别处理的方式,本申请实施例由于能够采用每一个业务场景下产生的真实语音语料对语音识别中的语言模型进行及时的训练,能够提高训练语音识别模型时的识别准确准确性,以提高对话管理中语音识别的准确性,从而提高对话完成率,尤其能够提升智能客服机器人的自助服务完成率。
在一个实施例中,所述对话语音语料包括若干次对话各自对应所形成的若干个语音语料,每次对话所形成的语音语料包括该次对话所对应的交互结果,所述交互结果包括转接人工, 所述通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料的步骤包括:识别出所述交互结果为转接人工的语音语料;将所述对话语音语料中交互结果为转接人工的语音语料进行剔除,以得到筛选后的对话语音语料;通过预设语音语料标注工具对所述筛选后的对话语音语料进行标注以得到标注语音语料。
具体地,一般由于提供自助服务的计算机设备处理不了的问题,才会转接人工,因此,凡是转接人工的自助服务都是计算机设备无法完成的服务,表明对该语音识别模型的训练中存在没有覆盖到的业务场景,可能是对该业务场景的语义理解有误,或者是对该业务场景不支持,因此,对于没有完成自助服务的语音识别,不适合直接作为训练模型的语音语料,而将交互结果为交互成功的语音语料作为训练模型的语料,只有将交互结果为交互成功的语音语料作为训练语音识别模型的语料,才能进一步提高语音识别模型的训练效率和训练准确性。针对交互结果为转接人工的业务场景需要后续业务人员去核查原因并采用人为方式重新训练语音识别模型。其中,交互结果是否转接人工,可以通过为字段赋值的方式进行判断,比如,交互结果转接人工,交互结果所对应的字段“R”赋值为“0”,交互结果未转接人工,自助服务和客户交互成功,交互结果所对应的字段“R”赋值为“1”等。
请参阅图3,图3为本申请实施例提供的基于语音识别的对话管理处理方法的一个子流程示意图,如图3所示,在该实施例中,所述对话管理系统调用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,并根据所述识别结果对所述用户语音进行回应以完成对话,将所述对话形成对话语音语料的步骤包括:
S301、接收用户语音所对应的第一次语音,生成所述用户语音所对应对话的预设对话编码标识,其中,预设对话编码标识可以为对话流水号,包括该对话的日期及时间、对话顺序编号、接入的自助服务机器编号等对话涉及的机器设备元素、时间元素及用户元素,按照第一预设顺序生成包含该对话的日期及时间、对话顺序编号、接入的自助服务机器编号等对话涉及的机器设备元素、时间元素及用户元素的字符串即可生成所述用户语音所对应对话的预设对话编码标识。
S302、根据所述预设对话编码标识,调用第一预设ASR模型,以通过所述第一预设ASR模型将所述第一次语音转换成用户文字,并基于所述预设对话编码标识生成该次调用所对应的ASR对话编码标识,其中,ASR对话编码标识为调用该ASR模型的流水号,包括所述预设对话编码标识、进行调用的日期及时间、属于第几次调用等所对应的调用次数,按照第二预设顺序生成包括所述预设对话编码标识、进行调用的日期及时间、属于第几次调用等所对应的调用次数的字符串即可生成ASR对话编码标识。
S303、调用第一预设NLU模型对所述用户文字进行理解以得到用户语义。S304、根据所述用户语义,从预设数据库中通过预设语义匹配方式筛选出与所述用户语义所对应的预设对答,其中,所述语义匹配包括语义精确匹配和语义模糊匹配,语义精确匹配为数据库中的预设对答中包含与用户语音中识别出来的语义存在完全相同语义的语义匹配方式,语义模糊匹配为数据库中的预设对答中包含与用户语音中识别出来的语义存在相同或者相似的语义匹配方式。
S305、将所述预设对答转换成回应语音,将所述回应语音发送至用户以对所述第一次语音进行回应。
S306、判断所述用户语音是否结束。S307、若所述用户语音未结束,接收所述用户语音所对应的第二次语音,迭代执行根据所述预设对话编码标识,调用第一预设ASR模型的步骤,直至所述用户语音结束,以完成对话,进入步骤S308。S308、若所述用户语音结束,完成对话。S309、将所述用户语音和所述预设对答形成对话语音语料,其中,所述对话语音语料中包括预设对话编码标识和所述ASR对话编码标识。
其中,自动语音识别,英文为Automatic Speech Recognition,简称“ASR“),可分为“传统”识别方式与“端到端”识别方式,其主要差异就体现在声学模型上,其中,“传统”方式的声学模型一般采用隐马尔可夫模型(英文简写为HMM),而“端到端”方式一般采用深度神经网 络(英文简写为DNN)。
具体地,在用户办理业务的对话中,需要用户和自助语音服务不断的进行交互,比如通过一问一答的形式进行交互,对话管理系统开始接受用户语音时,接收该用户语音的第一次语音,生成该次通话的预设对话编码标识,预设对话编码标识用于跟踪该次办理业务的通话,根据所述预设对话编码标识,调用第一预设ASR模型,以通过所述第一预设ASR模型将所述第一次语音转换成第一用户文字,并生成第一次语音所对应的第一ASR对话编码标识,第一ASR对话编码标识用于描述针对于第一次语音所对应的调用ASR模型,调用第一预设NLU模型对所述第一用户文字进行理解以得到第一用户语义,根据所述第一用户语义,从预设数据库中通过预设语义匹配方式筛选出与所述第一用户语义所对应的第一预设对答,将所述第一预设对答转换成第一回应语音,以对所述第一次语音进行回应,判断所述用户语音是否结束,若所述用户语音未结束,接收该用户语音的第二次语音,继续调用第一预设ASR模型,以通过所述第一预设ASR模型将所述第二次语音转换成第二用户文字,并生成第二次语音所对应的第二ASR对话编码标识,第二ASR对话编码标识用于描述针对于第二次语音所对应的调用ASR模型,继续调用第一预设NLU模型对所述第二用户文字进行理解以得到第二用户语义,根据所述第二用户语义,继续从预设数据库中通过预设语义匹配方式筛选出与所述第二用户语义所对应的第二预设对答,将所述第二预设对答转换成第二回应语音,以对所述第二次语音进行回应,再次判断所述用户语音是否结束,若所述用户语音未结束,继续接收该用户语音的第三次语音,再次迭代执行根据所述预设对话编码标识,调用第一预设ASR模型的步骤,直至用户语音结束,以完成对话,若所述用户语音结束,则完成对话,将该次通话中所包含的若干次语音和每次语音各自所对应的预设对答形成对话语音语料,其中,所述对话语音语料中包括预设对话编码标识和所述ASR对话编码标识。比如,当用户拨打语音通话办理自助服务的时候,对话管理系统可以通过产生一个UniqueID,这个UniqueID是用于标记一通对话的ID,对话管理系统会记录每一次用户说的话和对话管理系统的回答,并且每一次调用ASR模型将用户说的话进行音转字的过程会产生一个AsrSessionID,这个AsrSessionID是用于标记一次ASR交互,ASR系统将音转字的结果给到对话管理系统,对话管理系统会将文字调用NLU模型进行理解,根据理解结果,从数据库中选择预先设置的对应回答进行回应,从而实现用户与语音服务自助计算机设备的交互,并将交互过程中的对话形成对话语音语料,因此,在这整个过程中,ASR模型和NLU模型至关重要,针对一通对话和每一次ASR调用分别设置对话标识,才能通过对话标识将一通对话的上下文关联起来形成一个完整的交互过程,从而方便后续ASR和NLU模型根据上下文实现学习,从而提高语音识别模型训练的准确性。
在一个实施例中,所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料的步骤包括:使用预设ASR标注方式对所述对话语音语料进行标注,以得到ASR标注语音语料;使用预设NLU标注方式对所述对话语音语料进行标注,以得到NLU标注语音语料;
所述语音识别模型训练系统使用所述标注语音语料对第二预设语音识别模型进行训练的步骤包括:语音识别模型训练系统获取所述ASR标注语音语料和所述NLU标注语音语料;使用所述ASR标注语音语料对第二预设ASR模型进行训练,使用所述NLU标注语音语料对第二预设NLU模型进行训练。
其中,ASR模型为声学模型,是将声学和发音学的知识进行整合,从而将声音转换为文字。在进行语音识别以将声音转换为文字的时候,由于声音所发出的语音是连续性的声音,计算机设备不知道语音的哪一个部分对应的是哪一个音素或者词,需要首先通过ASR标注方式对所述对话语音语料进行标注,从而对语音能够自动切分音素或者词,进而将语音的因素或者词对应转换为文字,以实现通过ASR模型将语音转换为文字,ASR标注方式为对所述对话语音语料进行语音标注的方式,可以通过语音标注工具实现ASR标注,其中,语音标注工具包括Praat工具所对应的Praat标注方式及Transcriber工具所对应的Transcriber标注方式等。
NLU模型为语言模型,是用于通过训练语料学习词与词之间的相互关系,来估计假设词序列的可能性,又叫语言模型分数,体现的是由字组成词,由词组成句,从而表达语言文字内容的字及词之间的关系,语言模型通常可以实现对语言更准确的估计。语言模型包括SRILM、IRSTLM、MITLM及BerkeleyLM等。要通过NLU模型将文字转换为通常描述含义的词及句子,需要对得到的文字进行标注,从而将标注后的文字通过NLU模型组成具有含义的词及句子。因此,在将语音转换为文字后,需要对得到的文字通过预设NLU标注方式进行标注,以得到NLU标注语音语料,从而使用NLU模型将NLU标注语音语料转换为具有含义内容的词及句子,最终实现将语音转换为通常使用的文字语言。其中,对ASR标注语音语料所包含的文字进行标注的NLU标注方式可以通过语料标注工具对语音语料进行标注,语料标注工具包括Brat语料标注工具所对应的Brat语料标注方式、Parker语料标注工具所对应的Parker语料标注方式、YEDDA语料标注工具所对应的YEDDA语料标注方式、Snorkel语料标注工具所对应的Snorkel语料标注方式及Prodigy语料标注工具所对应的Prodigy语料标注方式等。
具体地,在本申请实施中,通过对话管理系统接收用户语音,使用第一ASR模型和第一NLU模型对所述用户语音进行识别以得到识别结果,并根据识别结果进行回应,以形成对话语音语料,发送所述对话语音语料至语料标注系统,语料标注系统接收所述对话语音语料,对所述对话语音语料进行标注以得到标注语音语料,使用预设ASR标注方式对所述对话语音语料进行标注,以得到ASR标注语音语料,使用预设NLU标注方式对所述对话语音语料进行标注,以得到NLU标注语音语料,并发送至文件服务器上,模型训练系统从文件服务器上获取所述标注语音语料,使用所述ASR标注语音语料和所述NLU标注语音语料分别对第二ASR模型和第二NLU模型进行训练,从而使用真实的语料实时的训练ASR模型和NLU模型,由于语音识别准确率与语言模型采用的语音语料有很大的关系,针对每一个业务,使用各自业务场景所产生的真实语音语料训练各自业务所对应的ASR模型,能够提高ASR模型对语音识别的准确率,由于ASR模型对语音识别准确率提高了,也会促进NLU模型的理解正确率,最终提升整个语音识别的准确率,从而最终实现提升智能客服机器人的自助服务完成率。
在一个实施例中,所述判断训练后的第二预设语音识别模型是否满足预设对话完成率条件的步骤之前,还包括:所述对话管理系统调用所述第二预设ASR模型和所述第二预设NLU模型对接收的新用户语音进行识别,并对所述新用户语音进行回应以完成对话;统计预设时间段内所述第一预设ASR模型和所述第一预设NLU模型对所述用户语音进行识别所完成对话的第一完成率;统计所述预设时间段内所述第二预设ASR模型和所述第二预设NLU模型对所述新用户语音进行识别所完成对话的第二完成率;
所述判断训练后的第二预设语音识别模型是否满足预设对话完成率条件的步骤包括:判断所述第二完成率是否大于所述第一完成率;若所述第二完成率大于所述第一完成率,判定所述训练后的第二预设语音识别模型满足预设对话完成率条件。
具体地,针对预设时间段内所述第一预设ASR模型和所述第一预设NLU模型,及所述预设时间段内所述第二预设ASR模型和所述第二预设NLU模型,分别统计各自在相同时间段内的完成率,判断所述第二完成率是否大于所述第一完成率,若所述第二完成率大于所述第一完成率,判定所述训练后的第二预设语音识别模型满足预设对话完成率条件,将所述第二预设ASR模型替换所述第一预设ASR模型,将所述第二预设NLU模型替换所述第一预设NLU模型。对于对话管理系统在预设时间段内提供的自助语音服务,然后对每一种自助语音服务的完成率进行统计,即统计交互结果为成功而不需要转接人工的自助语音服务结果所对应的对话,如果采用训练好的所述第二预设ASR模型和所述第二预设NLU模型之后,自助语音服务完成率得到了提升,就会采用所述第二预设ASR模型和所述第二预设NLU模型所对应的新模型,反之,如果采用训练好的所述第二预设ASR模型和所述第二预设NLU模型之后,自助语音服务完成率未得到提升,则继续沿用所述第一预设ASR模型和所述第一预设 NLU模型所对应的老模型,针对所述第二预设ASR模型和所述第二预设NLU模型继续进行训练。比如,对话管理系统每一个月会对每一种自助语音服务的完成率进行统计,如果采用训练好的第二预设语音识别模型之后,完成率得到了提升,就会采用第二预设语音识别模型,反之,则继续使用第一预设语音识别模型完成用户办理业务的对话,针对第二预设语音识别模型则继续训练。
由于语音识别准确率与其采用的语言模型有很大的关系,而语言模型的训练又与所采用的语音识别语料有直接关系,本申请实施例针对每一个业务均采用该业务所对应的真实语音识别语料,从而训练出不同ASR神经网络语言模型,语音识别准确率提高了,也会促进NLU模型的理解正确率,从而能够构建有业务针对性的语料标注系统和模型训练系统,尤其对于客服机器人,通过构建有业务针对性的语料标注系统和模型训练系统,进而提升语音识别的准确性和效率显得格外重要,才能在各种类型的专业服务上识别准确,最终提升智能客服机器人的自助服务完成率。
需要说明的是,上述各个实施例所述的基于语音识别的对话管理处理方法,可以根据需要将不同实施例中包含的技术特征重新进行组合,以获取组合后的实施方案,但都在本申请要求的保护范围之内。
请参阅图4,图4为本申请实施例提供的基于语音识别的对话管理处理装置的一个示意性框图。对应于上述所述基于语音识别的对话管理处理方法,本申请实施例还提供一种基于语音识别的对话管理处理装置。如图4所示,该基于语音识别的对话管理处理装置包括用于执行上述所述基于语音识别的对话管理处理方法的单元,该基于语音识别的对话管理处理装置可以被配置于计算机设备中。具体地,请参阅图4,该基于语音识别的对话管理处理装置400包括对话单元401、标注单元402、训练单元403、判断单元404及替换单元405。其中,对话单元401,用于通过对话管理系统接收用户语音,以使所述对话管理系统调用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,并根据所述识别结果对所述用户语音进行回应以完成对话,将所述对话形成对话语音语料;标注单元402,用于通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,以使所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料;训练单元403,用于通过第二预设消息中间件将所述标注语音语料发送至语音识别模型训练系统,以使所述语音识别模型训练系统使用所述标注语音语料对第二预设语音识别模型进行训练;判断单元404,用于判断训练后的第二预设语音识别模型是否满足预设对话完成率条件,其中,对话完成率为该预设时间段内基于语音识别所完成的对话数量占该预设时间段内所有对话数量的比例;替换单元405,用于若所述训练后的第二预设语音识别模型满足所述预设对话完成率条件,采用所述训练后的第二预设语音识别模型替换所述第一预设语音识别模型以供所述对话管理系统调用而完成新的对话。
在一个实施例中,所述对话语音语料中包括若干次对话各自对应所形成的若干个语音语料,每次对话所形成的语音语料包括该次对话所对应的交互结果,所述交互结果包括转接人工,所述标注单元402包括:识别子单元,用于识别出所述交互结果为转接人工的语音语料;剔除子单元,用于将所述对话语音语料中交互结果为转接人工的语音语料进行剔除,以得到筛选后的对话语音语料;标注子单元,用于通过预设语音语料标注工具对所述筛选后的对话语音语料进行标注以得到标注语音语料。
在一个实施例中,所述对话单元401包括:第一接收子单元,用于接收用户语音所对应的第一次语音,生成所述用户语音所对应对话的预设对话编码标识;第一调用子单元,用于根据所述预设对话编码标识,调用第一预设ASR模型,以通过所述第一预设ASR模型将所述第一次语音转换成用户文字,并基于所述预设对话编码标识生成该次调用所对应的ASR对话编码标识;第二调用子单元,用于调用第一预设NLU模型对所述用户文字进行理解以得到用户语义;筛选子单元,用于根据所述用户语义,从预设数据库中通过预设语义匹配方式筛选出与所述用户语义所对应的预设对答;回应子单元,用于将所述预设对答转换成回应语音, 将所述回应语音发送至用户以对所述第一次语音进行回应;第一判断子单元,用于判断所述用户语音是否结束;第二接收子单元,用于若所述用户语音未结束,接收所述用户语音所对应的第二次语音,迭代执行根据所述预设对话编码标识,调用第一预设ASR模型的步骤,直至所述用户语音结束,以完成对话;形成对话子单元,用于若所述用户语音结束,完成对话,并将所述用户语音和所述预设对答形成对话语音语料,其中,所述对话语音语料中包括所述预设对话编码标识和所述ASR对话编码标识。
在一个实施例中,所述标注单元402包括:第一标注子单元,用于使用预设ASR标注方式对所述对话语音语料进行标注,以得到ASR标注语音语料;第二标注子单元,用于使用预设NLU标注方式对所述对话语音语料进行标注,以得到NLU标注语音语料;
所述训练单元403包括:获取子单元,用于语音识别模型训练系统获取所述ASR标注语音语料和所述NLU标注语音语料;训练子单元,用于使用所述ASR标注语音语料对第二预设ASR模型进行训练,使用所述NLU标注语音语料对第二预设NLU模型进行训练。
在一个实施例中,所述基于语音识别的对话管理处理装置400还包括:调用单元,用于所述对话管理系统调用所述第二预设ASR模型和所述第二预设NLU模型对接收的新用户语音进行识别,并对所述新用户语音进行回应以完成对话;第一统计单元,用于统计预设时间段内所述第一预设ASR模型和所述第一预设NLU模型对所述用户语音进行识别所完成对话的第一完成率;第二统计单元,用于统计所述预设时间段内所述第二预设ASR模型和所述第二预设NLU模型对所述新用户语音进行识别所完成对话的第二完成率;
所述判断单元404包括:第二判断子单元,用于判断所述第二完成率是否大于所述第一完成率;判定子单元,用于若所述第二完成率大于所述第一完成率,判定所述训练后的第二预设语音识别模型满足预设对话完成率条件。
需要说明的是,所属领域的技术人员可以清楚地了解到,上述基于语音识别的对话管理处理装置和各单元的具体实现过程,可以参考前述方法实施例中的相应描述,为了描述的方便和简洁,在此不再赘述。
同时,上述基于语音识别的对话管理处理装置中各个单元的划分和连接方式仅用于举例说明,在其他实施例中,可将基于语音识别的对话管理处理装置按照需要划分为不同的单元,也可将基于语音识别的对话管理处理装置中各单元采取不同的连接顺序和方式,以完成上述基于语音识别的对话管理处理装置的全部或部分功能。
上述基于语音识别的对话管理处理装置可以实现为一种计算机程序的形式,该计算机程序可以在如图5所示的计算机设备上运行。
请参阅图5,图5是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备500可以是台式机电脑或者服务器等计算机设备,也可以是其他设备中的组件或者部件。
参阅图5,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504,存储器也可以为易失性存储介质。
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行一种上述基于语音识别的对话管理处理方法。
该处理器502用于提供计算和控制能力,以支撑整个计算机设备500的运行。
该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行一种上述基于语音识别的对话管理处理方法。
该网络接口505用于与其它设备进行网络通信。本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图5所示实施例 一致,在此不再赘述。
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现本申请实施例所描述的基于语音识别的对话管理处理方法。
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central ProcessingUnit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable GateArray,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以通过计算机程序来完成,该计算机程序可存储于一计算机可读存储介质。该计算机程序被该计算机系统中的至少一个处理器执行,以实现上述方法的实施例的步骤。
因此,本申请还提供一种计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质,也可以为易失性的计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时使处理器执行以上各实施例中所描述的所述基于语音识别的对话管理处理方法的步骤。
所述存储介质为实体的、非瞬时性的存储介质,例如可以是U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储计算机程序的实体存储介质。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
以上所述,仅为本申请的具体实施方式,但本申请明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种基于语音识别的对话管理处理方法,包括:
    通过对话管理系统接收用户语音,以使所述对话管理系统调用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,并根据所述识别结果对所述用户语音进行回应以完成对话,将所述对话形成对话语音语料;
    通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,以使所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料;
    通过第二预设消息中间件将所述标注语音语料发送至语音识别模型训练系统,以使所述语音识别模型训练系统使用所述标注语音语料对第二预设语音识别模型进行训练;
    判断训练后的第二预设语音识别模型是否满足预设对话完成率条件,其中,对话完成率为该预设时间段内基于语音识别所完成的对话数量占该预设时间段内所有对话数量的比例;
    若所述训练后的第二预设语音识别模型满足所述预设对话完成率条件,采用所述训练后的第二预设语音识别模型替换所述第一预设语音识别模型以供所述对话管理系统调用而完成新的对话。
  2. 根据权利要求1所述基于语音识别的对话管理处理方法,其中,所述对话语音语料包括若干次对话各自对应所形成的若干个语音语料,每次对话所形成的语音语料包括该次对话所对应的交互结果,所述交互结果包括转接人工,所述通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料的步骤包括:
    识别出所述交互结果为转接人工的语音语料;
    将所述对话语音语料中交互结果为转接人工的语音语料进行剔除,以得到筛选后的对话语音语料;
    通过预设语音语料标注工具对所述筛选后的对话语音语料进行标注以得到标注语音语料。
  3. 根据权利要求1所述基于语音识别的对话管理处理方法,其中,所述对话管理系统调用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,并根据所述识别结果对所述用户语音进行回应以完成对话,将所述对话形成对话语音语料的步骤包括:
    接收用户语音所对应的第一次语音,生成所述用户语音所对应对话的预设对话编码标识;
    根据所述预设对话编码标识,调用第一预设ASR模型,以通过所述第一预设ASR模型将所述第一次语音转换成用户文字,并基于所述预设对话编码标识生成该次调用所对应的ASR对话编码标识;
    调用第一预设NLU模型对所述用户文字进行理解以得到用户语义;
    根据所述用户语义,从预设数据库中通过预设语义匹配方式筛选出与所述用户语义所对应的预设对答;
    将所述预设对答转换成回应语音,将所述回应语音发送至用户以对所述第一次语音进行回应;
    判断所述用户语音是否结束;
    若所述用户语音未结束,接收所述用户语音所对应的第二次语音,迭代执行根据所述预设对话编码标识,调用第一预设ASR模型的步骤,直至所述用户语音结束,以完成对话;
    若所述用户语音结束,完成对话,并将所述用户语音和所述预设对答形成对话语音语料,其中,所述对话语音语料中包括所述预设对话编码标识和所述ASR对话编码标识。
  4. 根据权利要求3所述基于语音识别的对话管理处理方法,其中,所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料的步骤包括:
    使用预设ASR标注方式对所述对话语音语料进行标注,以得到ASR标注语音语料;
    使用预设NLU标注方式对所述对话语音语料进行标注,以得到NLU标注语音语料;
    所述语音识别模型训练系统使用所述标注语音语料对第二预设语音识别模型进行训练的步骤包括:
    语音识别模型训练系统获取所述ASR标注语音语料和所述NLU标注语音语料;
    使用所述ASR标注语音语料对第二预设ASR模型进行训练,使用所述NLU标注语音语料对第二预设NLU模型进行训练。
  5. 根据权利要求4所述基于语音识别的对话管理处理方法,其中,所述判断训练后的第二预设语音识别模型是否满足预设对话完成率条件的步骤之前,还包括:
    所述对话管理系统调用所述第二预设ASR模型和所述第二预设NLU模型对接收的新用户语音进行识别,并对所述新用户语音进行回应以完成对话;
    统计预设时间段内所述第一预设ASR模型和所述第一预设NLU模型对所述用户语音进行识别所完成对话的第一完成率;
    统计所述预设时间段内所述第二预设ASR模型和所述第二预设NLU模型对所述新用户语音进行识别所完成对话的第二完成率;
    所述判断训练后的第二预设语音识别模型是否满足预设对话完成率条件的步骤包括:
    判断所述第二完成率是否大于所述第一完成率;
    若所述第二完成率大于所述第一完成率,判定所述训练后的第二预设语音识别模型满足预设对话完成率条件。
  6. 根据权利要求4所述基于语音识别的对话管理处理方法,其中,所述预设ASR标注方式为Praat标注方式,所述预设NLU标注方式为Brat语料标注方式。
  7. 根据权利要求1所述基于语音识别的对话管理处理方法,其中,所述通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,以使所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料的步骤包括:
    通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,以使所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到初始标注语音语料;
    接收对所述初始标注语音语料进行的操作,以得到标注语音语料,其中,所述操作包括修订和确认。
  8. 一种基于语音识别的对话管理处理装置,包括:
    对话单元,用于通过对话管理系统接收用户语音,以使所述对话管理系统调用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,并根据所述识别结果对所述用户语音进行回应以完成对话,将所述对话形成对话语音语料;
    标注单元,用于通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,以使所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料;
    训练单元,用于通过第二预设消息中间件将所述标注语音语料发送至语音识别模型训练系统,以使所述语音识别模型训练系统使用所述标注语音语料对第二预设语音识别模型进行训练;
    判断单元,用于判断训练后的第二预设语音识别模型是否满足预设对话完成率条件,其中,对话完成率为该预设时间段内基于语音识别所完成的对话数量占该预设时间段内所有对话数量的比例;
    替换单元,用于若所述训练后的第二预设语音识别模型满足所述预设对话完成率条件,采用所述训练后的第二预设语音识别模型替换所述第一预设语音识别模型以供所述对话管理系统调用而完成新的对话。
  9. 一种计算机设备,所述计算机设备包括存储器以及与所述存储器相连的处理器;所述存储器用于存储计算机程序;所述处理器用于运行所述计算机程序,以执行如下步骤:
    通过对话管理系统接收用户语音,以使所述对话管理系统调用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,并根据所述识别结果对所述用户语音进行回应以完成对话,将所述对话形成对话语音语料;
    通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,以使所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料;
    通过第二预设消息中间件将所述标注语音语料发送至语音识别模型训练系统,以使所述语音识别模型训练系统使用所述标注语音语料对第二预设语音识别模型进行训练;
    判断训练后的第二预设语音识别模型是否满足预设对话完成率条件,其中,对话完成率为该预设时间段内基于语音识别所完成的对话数量占该预设时间段内所有对话数量的比例;
    若所述训练后的第二预设语音识别模型满足所述预设对话完成率条件,采用所述训练后的第二预设语音识别模型替换所述第一预设语音识别模型以供所述对话管理系统调用而完成新的对话。
  10. 根据权利要求9所述计算机设备,其中,所述对话语音语料包括若干次对话各自对应所形成的若干个语音语料,每次对话所形成的语音语料包括该次对话所对应的交互结果,所述交互结果包括转接人工,所述通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料的步骤包括:
    识别出所述交互结果为转接人工的语音语料;
    将所述对话语音语料中交互结果为转接人工的语音语料进行剔除,以得到筛选后的对话语音语料;
    通过预设语音语料标注工具对所述筛选后的对话语音语料进行标注以得到标注语音语料。
  11. 根据权利要求9所述计算机设备,其中,所述对话管理系统调用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,并根据所述识别结果对所述用户语音进行回应以完成对话,将所述对话形成对话语音语料的步骤包括:
    接收用户语音所对应的第一次语音,生成所述用户语音所对应对话的预设对话编码标识;
    根据所述预设对话编码标识,调用第一预设ASR模型,以通过所述第一预设ASR模型将所述第一次语音转换成用户文字,并基于所述预设对话编码标识生成该次调用所对应的ASR对话编码标识;
    调用第一预设NLU模型对所述用户文字进行理解以得到用户语义;
    根据所述用户语义,从预设数据库中通过预设语义匹配方式筛选出与所述用户语义所对应的预设对答;
    将所述预设对答转换成回应语音,将所述回应语音发送至用户以对所述第一次语音进行回应;
    判断所述用户语音是否结束;
    若所述用户语音未结束,接收所述用户语音所对应的第二次语音,迭代执行根据所述预设对话编码标识,调用第一预设ASR模型的步骤,直至所述用户语音结束,以完成对话;
    若所述用户语音结束,完成对话,并将所述用户语音和所述预设对答形成对话语音语料,其中,所述对话语音语料中包括所述预设对话编码标识和所述ASR对话编码标识。
  12. 根据权利要求11所述计算机设备,其中,所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料的步骤包括:
    使用预设ASR标注方式对所述对话语音语料进行标注,以得到ASR标注语音语料;
    使用预设NLU标注方式对所述对话语音语料进行标注,以得到NLU标注语音语料;
    所述语音识别模型训练系统使用所述标注语音语料对第二预设语音识别模型进行训练的步骤包括:
    语音识别模型训练系统获取所述ASR标注语音语料和所述NLU标注语音语料;
    使用所述ASR标注语音语料对第二预设ASR模型进行训练,使用所述NLU标注语音语料对第二预设NLU模型进行训练。
  13. 根据权利要求12所述计算机设备,其中,所述判断训练后的第二预设语音识别模型是否满足预设对话完成率条件的步骤之前,还包括:
    所述对话管理系统调用所述第二预设ASR模型和所述第二预设NLU模型对接收的新用户语音进行识别,并对所述新用户语音进行回应以完成对话;
    统计预设时间段内所述第一预设ASR模型和所述第一预设NLU模型对所述用户语音进 行识别所完成对话的第一完成率;
    统计所述预设时间段内所述第二预设ASR模型和所述第二预设NLU模型对所述新用户语音进行识别所完成对话的第二完成率;
    所述判断训练后的第二预设语音识别模型是否满足预设对话完成率条件的步骤包括:
    判断所述第二完成率是否大于所述第一完成率;
    若所述第二完成率大于所述第一完成率,判定所述训练后的第二预设语音识别模型满足预设对话完成率条件。
  14. 根据权利要求12所述计算机设备,其中,所述预设ASR标注方式为Praat标注方式,所述预设NLU标注方式为Brat语料标注方式。
  15. 根据权利要求9所述计算机设备,其中,所述通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,以使所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料的步骤包括:
    通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,以使所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到初始标注语音语料;
    接收对所述初始标注语音语料进行的操作,以得到标注语音语料,其中,所述操作包括修订和确认。
  16. 一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时可实现如下步骤:
    通过对话管理系统接收用户语音,以使所述对话管理系统调用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,并根据所述识别结果对所述用户语音进行回应以完成对话,将所述对话形成对话语音语料;
    通过第一预设消息中间件将所述对话语音语料发送至语料标注系统,以使所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料;
    通过第二预设消息中间件将所述标注语音语料发送至语音识别模型训练系统,以使所述语音识别模型训练系统使用所述标注语音语料对第二预设语音识别模型进行训练;
    判断训练后的第二预设语音识别模型是否满足预设对话完成率条件,其中,对话完成率为该预设时间段内基于语音识别所完成的对话数量占该预设时间段内所有对话数量的比例;
    若所述训练后的第二预设语音识别模型满足所述预设对话完成率条件,采用所述训练后的第二预设语音识别模型替换所述第一预设语音识别模型以供所述对话管理系统调用而完成新的对话。
  17. 根据权利要求16所述计算机可读存储介质,其中,所述对话语音语料包括若干次对话各自对应所形成的若干个语音语料,每次对话所形成的语音语料包括该次对话所对应的交互结果,所述交互结果包括转接人工,所述通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料的步骤包括:
    识别出所述交互结果为转接人工的语音语料;
    将所述对话语音语料中交互结果为转接人工的语音语料进行剔除,以得到筛选后的对话语音语料;
    通过预设语音语料标注工具对所述筛选后的对话语音语料进行标注以得到标注语音语料。
  18. 根据权利要求16所述计算机可读存储介质,其中,所述对话管理系统调用第一预设语音识别模型对所述用户语音进行识别以得到识别结果,并根据所述识别结果对所述用户语音进行回应以完成对话,将所述对话形成对话语音语料的步骤包括:
    接收用户语音所对应的第一次语音,生成所述用户语音所对应对话的预设对话编码标识;
    根据所述预设对话编码标识,调用第一预设ASR模型,以通过所述第一预设ASR模型将所述第一次语音转换成用户文字,并基于所述预设对话编码标识生成该次调用所对应的ASR对话编码标识;
    调用第一预设NLU模型对所述用户文字进行理解以得到用户语义;
    根据所述用户语义,从预设数据库中通过预设语义匹配方式筛选出与所述用户语义所对应的预设对答;
    将所述预设对答转换成回应语音,将所述回应语音发送至用户以对所述第一次语音进行回应;
    判断所述用户语音是否结束;
    若所述用户语音未结束,接收所述用户语音所对应的第二次语音,迭代执行根据所述预设对话编码标识,调用第一预设ASR模型的步骤,直至所述用户语音结束,以完成对话;
    若所述用户语音结束,完成对话,并将所述用户语音和所述预设对答形成对话语音语料,其中,所述对话语音语料中包括所述预设对话编码标识和所述ASR对话编码标识。
  19. 根据权利要求18所述计算机可读存储介质,其中,所述语料标注系统通过预设语音语料标注工具对所述对话语音语料进行标注,以得到标注语音语料的步骤包括:
    使用预设ASR标注方式对所述对话语音语料进行标注,以得到ASR标注语音语料;
    使用预设NLU标注方式对所述对话语音语料进行标注,以得到NLU标注语音语料;
    所述语音识别模型训练系统使用所述标注语音语料对第二预设语音识别模型进行训练的步骤包括:
    语音识别模型训练系统获取所述ASR标注语音语料和所述NLU标注语音语料;
    使用所述ASR标注语音语料对第二预设ASR模型进行训练,使用所述NLU标注语音语料对第二预设NLU模型进行训练。
  20. 根据权利要求19所述计算机可读存储介质,其中,所述判断训练后的第二预设语音识别模型是否满足预设对话完成率条件的步骤之前,还包括:
    所述对话管理系统调用所述第二预设ASR模型和所述第二预设NLU模型对接收的新用户语音进行识别,并对所述新用户语音进行回应以完成对话;
    统计预设时间段内所述第一预设ASR模型和所述第一预设NLU模型对所述用户语音进行识别所完成对话的第一完成率;
    统计所述预设时间段内所述第二预设ASR模型和所述第二预设NLU模型对所述新用户语音进行识别所完成对话的第二完成率;
    所述判断训练后的第二预设语音识别模型是否满足预设对话完成率条件的步骤包括:
    判断所述第二完成率是否大于所述第一完成率;
    若所述第二完成率大于所述第一完成率,判定所述训练后的第二预设语音识别模型满足预设对话完成率条件。
PCT/CN2020/122422 2020-06-16 2020-10-21 基于语音识别的对话管理处理方法、装置、设备及介质 WO2021135534A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010550379.9 2020-06-16
CN202010550379.9A CN111739519A (zh) 2020-06-16 2020-06-16 基于语音识别的对话管理处理方法、装置、设备及介质

Publications (1)

Publication Number Publication Date
WO2021135534A1 true WO2021135534A1 (zh) 2021-07-08

Family

ID=72649914

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/122422 WO2021135534A1 (zh) 2020-06-16 2020-10-21 基于语音识别的对话管理处理方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN111739519A (zh)
WO (1) WO2021135534A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111739519A (zh) * 2020-06-16 2020-10-02 平安科技(深圳)有限公司 基于语音识别的对话管理处理方法、装置、设备及介质
CN112347768B (zh) * 2020-10-12 2023-06-27 出门问问(苏州)信息科技有限公司 一种实体识别方法及装置
CN112233665A (zh) * 2020-10-16 2021-01-15 珠海格力电器股份有限公司 模型训练的方法和装置、电子设备和存储介质
CN112653798A (zh) * 2020-12-22 2021-04-13 平安普惠企业管理有限公司 智能客服语音应答方法、装置、计算机设备及存储介质
CN112837683B (zh) * 2020-12-31 2022-07-26 思必驰科技股份有限公司 语音服务方法及装置
CN113608664A (zh) * 2021-07-26 2021-11-05 京东科技控股股份有限公司 智能语音机器人交互效果优化方法、装置及智能机器人
CN114441029A (zh) * 2022-01-20 2022-05-06 深圳壹账通科技服务有限公司 语音标注系统的录音噪音检测方法、装置、设备及介质
CN116108373A (zh) * 2023-04-17 2023-05-12 京东科技信息技术有限公司 话单数据分类标注系统、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190237061A1 (en) * 2018-01-31 2019-08-01 Semantic Machines, Inc. Training natural language system with generated dialogues
CN110120221A (zh) * 2019-06-06 2019-08-13 上海蔚来汽车有限公司 用于车机系统的用户个性化离线语音识别方法及其系统
CN110377911A (zh) * 2019-07-23 2019-10-25 中国工商银行股份有限公司 对话框架下的意图识别方法和装置
CN110543552A (zh) * 2019-09-06 2019-12-06 网易(杭州)网络有限公司 对话交互方法、装置及电子设备
CN110765270A (zh) * 2019-11-04 2020-02-07 苏州思必驰信息科技有限公司 用于口语交互的文本分类模型的训练方法及系统
CN111739519A (zh) * 2020-06-16 2020-10-02 平安科技(深圳)有限公司 基于语音识别的对话管理处理方法、装置、设备及介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103000052A (zh) * 2011-09-16 2013-03-27 上海先先信息科技有限公司 人机互动的口语对话系统及其实现方法
CN107945792B (zh) * 2017-11-06 2021-05-28 百度在线网络技术(北京)有限公司 语音处理方法和装置
CN110059170B (zh) * 2019-03-21 2022-04-26 北京邮电大学 基于用户交互的多轮对话在线训练方法及系统
CN110263322B (zh) * 2019-05-06 2023-09-05 平安科技(深圳)有限公司 用于语音识别的音频语料筛选方法、装置及计算机设备
CN110265001B (zh) * 2019-05-06 2023-06-23 平安科技(深圳)有限公司 用于语音识别训练的语料筛选方法、装置及计算机设备
CN110503143B (zh) * 2019-08-14 2024-03-19 平安科技(深圳)有限公司 基于意图识别的阈值选取方法、设备、存储介质及装置
CN111143535B (zh) * 2019-12-27 2021-08-10 北京百度网讯科技有限公司 用于生成对话模型的方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190237061A1 (en) * 2018-01-31 2019-08-01 Semantic Machines, Inc. Training natural language system with generated dialogues
CN110120221A (zh) * 2019-06-06 2019-08-13 上海蔚来汽车有限公司 用于车机系统的用户个性化离线语音识别方法及其系统
CN110377911A (zh) * 2019-07-23 2019-10-25 中国工商银行股份有限公司 对话框架下的意图识别方法和装置
CN110543552A (zh) * 2019-09-06 2019-12-06 网易(杭州)网络有限公司 对话交互方法、装置及电子设备
CN110765270A (zh) * 2019-11-04 2020-02-07 苏州思必驰信息科技有限公司 用于口语交互的文本分类模型的训练方法及系统
CN111739519A (zh) * 2020-06-16 2020-10-02 平安科技(深圳)有限公司 基于语音识别的对话管理处理方法、装置、设备及介质

Also Published As

Publication number Publication date
CN111739519A (zh) 2020-10-02

Similar Documents

Publication Publication Date Title
WO2021135534A1 (zh) 基于语音识别的对话管理处理方法、装置、设备及介质
CA2576605C (en) Natural language classification within an automated response system
CN111212190B (zh) 一种基于话术策略管理的对话管理方法、装置和系统
EP1602102B1 (en) Management of conversations
US8914294B2 (en) System and method of providing an automated data-collection in spoken dialog systems
US8515736B1 (en) Training call routing applications by reusing semantically-labeled data collected for prior applications
US6519562B1 (en) Dynamic semantic control of a speech recognition system
US7907705B1 (en) Speech to text for assisted form completion
US8165887B2 (en) Data-driven voice user interface
CN110853649A (zh) 基于智能语音技术的标签提取方法、系统、设备及介质
CN108899013A (zh) 语音搜索方法、装置和语音识别系统
CN117149977A (zh) 一种基于机器人流程自动化的智能催收机器人
CN115022471B (zh) 一种智能机器人语音交互系统和方法
CN114238605B (zh) 一种智能语音客服机器人自动对话方法及装置
CN117636877B (zh) 一种基于语音指令的智能系统操作方法及系统
Di Fabbrizio et al. Bootstrapping spoken dialogue systems by exploiting reusable libraries
KR101002135B1 (ko) 음절 음성인식기의 음성인식결과 전달 방법
KR20230115723A (ko) 실시간 상담 평가 시스템
CN113889112A (zh) 一种基于kaldi的在线语音识别的方法
CN117829819A (zh) 故障处理方法、设备及计算机可读存储介质
CN115640386A (zh) 用于基于推荐话术进行对话的方法和设备
CN115602172A (zh) 一种智能外呼方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20908668

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20908668

Country of ref document: EP

Kind code of ref document: A1