CN111739519A

CN111739519A - Dialogue management processing method, device, equipment and medium based on voice recognition

Info

Publication number: CN111739519A
Application number: CN202010550379.9A
Authority: CN
Inventors: 叶怡周; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-06-16
Filing date: 2020-06-16
Publication date: 2020-10-02
Also published as: WO2021135534A1

Abstract

The embodiment of the application provides a dialogue management processing method and device based on voice recognition, computer equipment and a computer readable storage medium. The utility model relates to an artificial intelligence technical field, receive user's pronunciation through dialogue management system, use first preset speech recognition model to discern user's pronunciation, respond to user's pronunciation according to the recognition result, form the dialogue pronunciation corpus, send dialogue pronunciation corpus to corpus mark system, corpus mark system marks the dialogue pronunciation corpus in order to obtain the mark pronunciation corpus, model training system obtains mark pronunciation corpus, use mark pronunciation corpus to train the second preset speech recognition model, judge whether the second after training presets the speech recognition model and satisfies the conversation rate condition of accomplishing in advance, if satisfy preset conversation rate condition, adopt the second after training to predetermine the speech recognition model and for dialogue management system calls and accomplish new dialogue, can improve the conversation rate of accomplishing.

Description

Dialogue management processing method, device, equipment and medium based on voice recognition

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for processing dialog management based on speech recognition, a computer device, and a computer-readable storage medium.

Background

With the development of the voice recognition technology, especially the application of the voice recognition technology to robots such as self-service robots, higher requirements are put forward on the recognition effect of voice recognition, and especially for various customer service robots, accurate voice recognition of users is required in professional services corresponding to various application scenes.

Speech Recognition is performed, typically by a speech Recognition model, such as an ASR (automatic speech Recognition) model. The recognition rate of the ASR model is mainly determined by an acoustic model and a language model, after the word sequences are generated through the acoustic model, a group of word sequences most conforming to a normal utterance are selected through the language model to serve as a final phonetic transcription result. In the traditional technology, the ASR model is trained mainly by adopting the accumulated training sample corpora, and the ASR model recognition training is put into a production environment after the development environment and the test environment verify that the requirements are met. Therefore, in the conventional ASR technology, since the ASR model trained using the limited corpus accumulated in advance is used, in an actual production environment, the query speech of the user is varied, and the answer speech cannot cover all the query speech of the user, whereas the ASR model cannot realize accurate speech recognition for the query speech model not covered in training the ASR model. Therefore, in the conventional ASR training system, even when the ASR model is trained, the training effect of the ASR model can meet the requirements, and in the production environment, there is also a problem of inaccurate answer due to low speech recognition accuracy, and the ASR model needs to be trained repeatedly at regular intervals according to the corpus generated in the production environment, so that the ASR model can be updated.

Because the ASR model in the traditional technology can not be updated in time, the ASR training period is invisibly prolonged, the training efficiency of the ASR model is reduced, and the training efficiency of the ASR model is lower, the completion rate of conversation answering of the ASR model in a production environment can not be timely improved, the completion rate of the conversation answering is improved by improving the accuracy of voice recognition, and the self-service quality of various robots is reduced.

Disclosure of Invention

The embodiment of the application provides a dialogue management processing method and device based on speech recognition, computer equipment and a computer readable storage medium, and can solve the problem that the self-service quality of various robots is reduced due to the fact that the training efficiency of an ASR model is low in the traditional technology.

In a first aspect, an embodiment of the present application provides a dialog management processing method based on speech recognition, where the method includes: receiving user voice through a dialogue management system, so that the dialogue management system calls a first preset voice recognition model to recognize the user voice to obtain a recognition result, responses the user voice according to the recognition result to complete dialogue, and the dialogue forms dialogue voice corpus; sending the dialogue voice corpus to a corpus tagging system through a first preset message middleware, so that the corpus tagging system tags the dialogue voice corpus through a preset voice corpus tagging tool to obtain a tagged voice corpus; sending the marked voice corpus to a voice recognition model training system through a second preset message middleware, so that the voice recognition model training system trains a second preset voice recognition model by using the marked voice corpus; judging whether the trained second preset voice recognition model meets a preset conversation completion rate condition, wherein the conversation completion rate is the proportion of the number of conversations completed based on voice recognition in the preset time period to all the number of conversations in the preset time period; and if the trained second preset voice recognition model meets the preset conversation completion rate condition, replacing the first preset voice recognition model with the trained second preset voice recognition model so as to be called by the conversation management system to complete a new conversation.

In a second aspect, an embodiment of the present application further provides a dialog management processing apparatus based on speech recognition, including: the dialogue unit is used for receiving user voice through a dialogue management system, so that the dialogue management system calls a first preset voice recognition model to recognize the user voice to obtain a recognition result, responses are made to the user voice according to the recognition result to complete dialogue, and the dialogue forms dialogue voice linguistic data; the labeling unit is used for sending the dialogue voice corpus to a corpus labeling system through a first preset message middleware so that the corpus labeling system labels the dialogue voice corpus through a preset voice corpus labeling tool to obtain a labeled voice corpus; the training unit is used for sending the marked voice corpus to a voice recognition model training system through a second preset message middleware so that the voice recognition model training system trains a second preset voice recognition model by using the marked voice corpus; the judging unit is used for judging whether the trained second preset voice recognition model meets a preset conversation completion rate condition or not, wherein the conversation completion rate is the proportion of the number of conversations completed based on voice recognition in the preset time period to all the number of conversations in the preset time period; and the replacing unit is used for replacing the first preset voice recognition model by the trained second preset voice recognition model to be called by the dialogue management system to complete new dialogue if the trained second preset voice recognition model meets the preset dialogue completion rate condition.

In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the dialog management processing method based on speech recognition when executing the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the processor is caused to execute the steps of the dialog management processing method based on speech recognition.

The embodiment of the application provides a dialogue management processing method and device based on voice recognition, computer equipment and a computer readable storage medium. Receiving user voice through a dialogue management system, using a first preset voice recognition model right user voice is recognized to obtain a recognition result, and right according to the recognition result the user voice is responded to form interactive dialogue voice corpus, and is sent through a message middleware to dialogue voice corpus to corpus tagging system, corpus tagging system receives dialogue voice corpus, right dialogue voice corpus is tagged to obtain tagging voice corpus, and a model training system obtains tagging voice corpus is used the tagging voice corpus is trained to a second preset voice recognition model, and whether the second preset voice recognition model after training satisfies preset dialogue completion rate conditions is judged, if the second preset voice recognition model after training satisfies the preset dialogue completion rate conditions, the second preset voice recognition model after training is adopted to replace the first preset voice recognition model for supplying the first preset voice recognition model The dialogue management system is called to complete a new dialogue, the speech recognition accuracy rate has a direct relation with the speech corpus used for training the speech model, and in the embodiment of the application, the dialogue management system, the corpus tagging system and the speech recognition model training system are coupled, so that the real speech corpus generated by the dialogue management system can be timely sent to the corpus tagging system for tagging, and the model training system can use the tagged real speech corpus to train the second preset speech recognition model in real time, compared with the traditional method that the dialogue management system, the corpus tagging system and the speech recognition model training system are split to be processed respectively, the embodiment of the application can adopt the real speech corpus generated in each service scene to train the speech model in time in speech recognition, the recognition accuracy in the process of training the voice recognition model can be improved, the accuracy of voice recognition in conversation management is improved, the conversation completion rate is improved, and particularly the self-service completion rate of the intelligent customer service robot can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a dialog management processing method based on speech recognition according to an embodiment of the present application;

fig. 2 is a schematic diagram of a dialog management processing method based on speech recognition according to an embodiment of the present application;

fig. 3 is a schematic sub-flow chart of a dialog management processing method based on speech recognition according to an embodiment of the present application;

FIG. 4 is a schematic block diagram of a dialog management processing apparatus based on speech recognition according to an embodiment of the present application; and

fig. 5 is a schematic block diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a dialog management processing method based on speech recognition according to an embodiment of the present application. As shown in fig. 1, the method comprises the following steps S101-S105:

s101, receiving user voice through a dialogue management system, enabling the dialogue management system to call a first preset voice recognition model to recognize the user voice to obtain a recognition result, responding to the user voice according to the recognition result to complete dialogue, and enabling the dialogue to form dialogue voice corpus.

In particular, the speech recognition model training system needs real context corpora to improve the accuracy of the speech recognition model training, the dialogue management system can provide real dialogue of the scene corresponding to the business transacted by the user, receive the voice of the user through the dialogue management system, the dialog management system invokes a first predetermined speech recognition model to recognize the user speech to obtain a recognition result, for example, the dialogue management system calls a first ASR (Automatic speech recognition) model and a first NLU (Natural Language Understanding, NLU) (both models used currently) to recognize the user voice to obtain a recognition result corresponding to the user voice, and responses the user voice according to the recognition result to complete the dialogue, therefore, interaction between the user and the intelligent voice computer equipment is realized, and finally the dialogue is formed into dialogue voice linguistic data.

S102, the dialogue voice corpus is sent to a corpus labeling system through a first preset message middleware, so that the corpus labeling system labels the dialogue voice corpus through a preset voice corpus labeling tool to obtain a labeled voice corpus.

The speech corpus labeling method comprises the steps of providing speech corpuses which accord with a speech recognition model and are required for recognition in natural language processing for the speech recognition model, labeling the speech corpuses in a speech recognition mode, labeling the speech corpuses in an NLU labeling mode, wherein the speech corpuses are labeled in a mode of realizing the ASR labeling mode by adopting a Praat tool, a Transcriber tool and the like, and labeling the speech corpuses in a mode of adopting a corpus labeling tool Brat corpus labeling tool, a Prodigy tool or a YEDDA tool and the like.

Specifically, after the dialogue management system interacts with a user to obtain dialogue voice corpora, the dialogue voice corpora are sent to a corpus labeling system through a first preset message middleware, the corpus labeling system obtains the dialogue voice corpora, the dialogue voice corpora are labeled through a preset voice corpus labeling tool to obtain labeled voice corpora, and the labeled voice corpora are sent to a file server. For example, please refer to fig. 2, fig. 2 is a schematic diagram of a specific embodiment of a dialog management processing method based on speech recognition according to the embodiment of the present application, as shown in fig. 2, after performing a human-computer interaction, a dialog management system sends a text of the human-computer dialog of the interaction, an asrsesionid (i.e., ASR dialog ID), an interaction time point, interaction result information (e.g., success or failure, forwarding manual work, etc.), and auxiliary information (e.g., non-sensitive information such as a transacted business process name, a telephone area number, a telephone gender, etc.) to a corpus tagging system through an MQ (Message Queue, MQ for short, also called Message middleware), where the corpus tagging system tags according to the dialog, and currently tags for a corpus of an ASR language model and a corpus of an NLU model respectively. Compared with the prior art that ASR products and a dialogue management system are separated in a plurality of corpus labels, and an ASR model and the dialogue management system do not form integrated cooperation, the embodiment of the application decouples the ASR products and the dialogue management system to form integrated cooperation.

Further, receiving operation carried out on the labeled voice corpus, wherein the operation comprises revision and confirmation. Aiming at the labeled corpus, semi-automation can be adopted, firstly, the recording of a user is transcribed into characters through an ASR engine to be labeled, then a labeling person looks at to check whether the results meet the requirements, operation is not needed if the results are accurate, the labeling person needs to modify the results into correct characters if the results are wrong, then after the labeling person confirms that the results are correct, the training of a voice recognition model is carried out aiming at the confirmed corpus, so that the labeling accuracy is ensured, and the accuracy of voice recognition is improved.

S103, sending the marked voice corpus to a voice recognition model training system through a second preset message middleware, so that the voice recognition model training system trains a second preset voice recognition model by using the marked voice corpus.

The second preset speech recognition model and the first preset speech recognition model may be the same or different, and may be constructed based on the same speech recognition model or based on different speech recognition models.

Specifically, the corpus tagging system pushes the tagged corpus to a file server, and then the model training system acquires the tagged speech corpus from the file server, so that the tagged speech corpus can be used for training a speech recognition model, wherein the training comprises training for an ASR model and training for an NLU model. The speech recognition model is trained aiming at the ASR, the ASR adopts a neural network language model, the ASR model needs to refer to the context, namely the question of a machine, the model expresses the semanteme more accurately, for example, the result of the phonetic transcription is 'credit card active', the neural network language model is quickly corrected into 'credit card active', meanwhile, the field of the language model is strong, so different services adopt different language models, the speech corpus of the application is directly and real-timely sourced from a dialogue management system, so that the speech recognition model is trained by adopting the real speech corpus generated in the real service context, the speech recognition rate of the speech recognition model under the service scene can be greatly provided, the speech recognition model is closely combined with the real service scene, and the trained speech recognition model and the real service scene have consistent matching, training the speech recognition model is targeted to the business scenario. Meanwhile, the NLU model can be trained, and the training aiming at the NLU model is combined with the context, so that the characteristic that the relevance between the language model adopted in the speech recognition model and the service field is very strong is met, the language model in the speech recognition can be trained according to different service scenes, the accuracy and the efficiency of the speech recognition can be improved, and the completion rate of the autonomous service is especially improved.

Referring to fig. 2, in the embodiment of the present application, a dialog management system receives a user's voice, a first voice recognition model (e.g., the first voice recognition model includes a first ASR model and a first NLU model) is used to recognize the user's voice to obtain a recognition result, and a response is made according to the recognition result, so as to form a dialog voice corpus in an interaction process, the dialog voice corpus is sent to a corpus tagging system, the corpus tagging system receives the dialog voice corpus, the dialog voice corpus is tagged to obtain a voice tagged corpus, and the voice tagged corpus is sent to a file server, a model training system obtains the tagged voice corpus from the file server, a second voice recognition model (e.g., the second voice recognition model includes a second ASR model and a second NLU model) is used to train, an MQ is used for communication between the dialog management system and the corpus tagging system, the corpus tagging system communicates with the model training system by using MQ, because the dialogue phonetic corpus generated by the dialogue management system belongs to a scene for handling a complete service (such as credit card fixed limit, etc.), semantic contexts in the dialogue phonetic corpus are closely related, the dialogue phonetic corpus for training the speech recognition model contains context relationship, the service appeal is that the customer can complete the handling every incoming line, because the corpus has the internal context corresponding to the internal context, the corpus and the result generated in the handling process are timely fed back to the ASR system and the NLU model in the NLP, the model training system can timely train the ASR model and the NLU model by adopting the real context corpus with the context relationship, the speech recognition accuracy of the ASR model and the NLU model can be timely improved, and the adjustment can be timely carried out according to the service scene applied by the dialogue management system, therefore, the accuracy of voice recognition can be timely improved, when the embodiment of the application is applied to self-service, the success rate of business handling of the client can be improved, and waste of business handling resources is avoided.

In the traditional technology, because the dialogue management system, the corpus tagging system and the model training system are all split, data are manually led and tagged, the efficiency is low, and the timeliness is also lagged, but in the embodiment of the application, the dialogue management system, the corpus tagging system and the model training system are decoupled through MQ, so that the integration of the dialogue management system, the corpus tagging system and the model training system is realized while the mutual noninterference of the operation of the dialogue management system, the corpus tagging system and the model training system is realized, the business model corresponding to a business scene can have corresponding real voice corpus, and if the business scenes are new, the production environment is operated through dialogue management, the corpora of the real scene can be obtained after the business scenes are handled by real clients, and the corpora can be used as corpora trained by the voice recognition model, so that the reliability of the voice recognition model relative to the business scene is ensured, the situation that linguistic data of new scenes do not exist when a speech recognition model is trained in the traditional technology is avoided, and a feedback closed loop is not formed, so that only robots belonging to question-answer types can be used in the traditional technology, and only question-answer types can be achieved.

And S104, judging whether the trained second preset voice recognition model meets a preset conversation completion rate condition, wherein the conversation completion rate is the proportion of the number of conversations completed based on voice recognition in the preset time period to all the number of conversations in the preset time period.

And S105, if the trained second preset speech recognition model meets the preset conversation completion rate condition, replacing the first preset speech recognition model with the trained second preset speech recognition model to be used by the conversation management system to call and complete a new conversation, and if the trained second preset speech recognition model does not meet the preset conversation completion rate condition, continuing to train the second preset speech recognition model by adopting the new conversation speech corpus generated in the step S101 until the trained second preset speech recognition model meets the preset conversation completion rate condition.

The conversation completion rate is the proportion of the number of conversations completed based on voice recognition in the preset time period to all the number of conversations in the preset time period, and the preset conversation completion rate condition refers to whether the conversation proportion of the second preset voice recognition model completed by self in the preset time period meets the expectation, for example, whether the conversation proportion is larger than or equal to a preset proportion value, or whether the conversation proportion of the second preset voice recognition model completed by self in the preset time period is larger than the conversation proportion of the originally used voice recognition model completed by self in the same preset time period, and the like.

Specifically, after a second preset speech recognition model is trained for a preset period, for example, after one month or half a year, the speech corpus generated in a real scene of the business handled by the user and generated by a dialog management system is used to judge whether the trained second preset speech recognition model meets a preset dialog completion rate condition, if the trained second preset speech recognition model meets the preset dialog completion rate condition, the trained second preset speech recognition model is used to replace the first preset speech recognition model, and when the dialog management system receives the speech service of the business handled by the user, the second preset speech recognition model is called to recognize the speech of the business handled by the user, that is, the dialog management system calls the trained second preset speech recognition model to complete a new dialog. The trained second preset voice recognition model is generated in real time by adopting a dialogue management system, and voice corpora generated by the real scene of the business handled by the user can better meet the actual needs of the real business handled, and the trained second voice recognition model can be more suitable for the real scene of the business handled, so that the accuracy of voice recognition in the business handled can be improved, the dialogue quality is improved, and the dialogue completion rate is improved.

Receiving user voice through a dialogue management system, using a first preset voice recognition model right user voice is recognized to obtain a recognition result, and right according to the recognition result the user voice is responded to form interactive dialogue voice corpus, and is sent through a message middleware to dialogue voice corpus to corpus tagging system, corpus tagging system receives dialogue voice corpus, right dialogue voice corpus is tagged to obtain tagging voice corpus, and a model training system obtains tagging voice corpus is used the tagging voice corpus is trained to a second preset voice recognition model, and whether the second preset voice recognition model after training satisfies preset dialogue completion rate conditions is judged, if the second preset voice recognition model after training satisfies the preset dialogue completion rate conditions, the second preset voice recognition model after training is adopted to replace the first preset voice recognition model for supplying the first preset voice recognition model The dialogue management system is called to complete a new dialogue, the speech recognition accuracy rate has a direct relation with the speech corpus used for training the speech model, and in the embodiment of the application, the dialogue management system, the corpus tagging system and the speech recognition model training system are coupled, so that the real speech corpus generated by the dialogue management system can be timely sent to the corpus tagging system for tagging, and the model training system can use the tagged real speech corpus to train the second preset speech recognition model in real time, compared with the traditional method that the dialogue management system, the corpus tagging system and the speech recognition model training system are split to be processed respectively, the embodiment of the application can adopt the real speech corpus generated in each service scene to train the speech model in time in speech recognition, the recognition accuracy during training of the voice recognition model can be improved, the accuracy of voice recognition in conversation management is improved, the conversation completion rate is improved, and particularly the self-service completion rate of the intelligent customer service robot can be improved.

In one embodiment, the dialog speech corpus includes a plurality of speech corpuses formed by a plurality of dialogues respectively corresponding to the dialogues, the speech corpus formed by each dialog includes an interaction result corresponding to the dialog, the interaction result includes a manual transfer, and the step of labeling the dialog speech corpus by a preset speech corpus labeling tool to obtain a labeled speech corpus includes:

recognizing that the interaction result is a manual switching voice corpus;

removing the interactive result in the dialogue voice corpus as a switching artificial voice corpus to obtain a screened dialogue voice corpus;

and labeling the screened dialogue voice linguistic data through a preset voice linguistic data labeling tool to obtain labeled voice linguistic data.

Specifically, generally, since the computer device providing the self-service cannot handle the problem, the manual work is switched, and therefore, all the self-service services for switching the manual work are services that the computer device cannot complete, which indicates that a service scene that is not covered exists in the training of the speech recognition model, and it may be that the semantic understanding of the service scene is incorrect or the service scene is not supported, and therefore, for the speech recognition that does not complete the self-service, the speech corpus that is not directly used as the speech corpus of the training model, and the speech corpus that has an interaction result that is successful in interaction is used as the corpus of the training model, and only the speech corpus that has an interaction result that is successful in interaction is used as the corpus of the training speech recognition model, the training efficiency and the training accuracy of the speech recognition model can be further improved. And (4) for the service scene with the interaction result of switching manual work, subsequent service personnel need to remove cores and check reasons and retrain the voice recognition model in a manual mode. Whether the interaction result is transferred to the manual work or not can be judged by means of field assignment, for example, the interaction result is transferred to the manual work, the field R corresponding to the interaction result is assigned to be 0, the interaction result is not transferred to the manual work, the self-service and customer interaction is successful, the field R corresponding to the interaction result is assigned to be 1, and the like.

Referring to fig. 3, fig. 3 is a schematic sub-flow diagram of a dialog management processing method based on speech recognition according to an embodiment of the present application, and as shown in fig. 3, in this embodiment, the dialog management system invokes a first preset speech recognition model to recognize the user speech to obtain a recognition result, and responds to the user speech according to the recognition result to complete a dialog, where the step of forming the dialog into a dialog speech corpus includes:

s301, receiving a first voice corresponding to the user voice, generating a preset dialogue coding identifier of a dialogue corresponding to the user voice, wherein the preset dialogue coding identifier can be a dialogue serial number and comprises machine equipment elements, time elements and user elements related to the dialogue, such as the date and time of the dialogue, a dialogue sequence number, an accessed self-service machine number and the like, and generating a character string containing the machine equipment elements, the time elements and the user elements related to the dialogue, such as the date and time of the dialogue, the dialogue sequence number, the accessed self-service machine number and the like, according to the first preset sequence to generate the preset dialogue coding identifier of the dialogue corresponding to the user voice.

S302, calling a first preset ASR model according to the preset dialogue coding identification, converting the first voice into user characters through the first preset ASR model, and generating an ASR dialogue coding identification corresponding to the calling based on the preset dialogue coding identification, wherein the ASR dialogue coding identification is a serial number for calling the ASR model and comprises the preset dialogue coding identification, the calling date and time, the calling times corresponding to the calling times and the like, and generating a character string comprising the preset dialogue coding identification, the calling date and time, the calling times corresponding to the calling times according to a second preset sequence to generate the ASR dialogue coding identification.

S303, calling a first preset NLU model to understand the user characters to obtain user semantics.

S304, according to the user semantics, screening a preset answer corresponding to the user semantics from a preset database in a preset semantic matching mode, wherein the semantic matching comprises semantic exact matching and semantic fuzzy matching, the semantic exact matching is a semantic matching mode in which the preset answer in the database contains the semantics which are completely the same as the semantics identified in the user speech, and the semantic fuzzy matching is a semantic matching mode in which the preset answer in the database contains the semantics which are the same as or similar to the semantics identified in the user speech.

S305, converting the preset answer into a response voice, and sending the response voice to a user to respond to the first voice.

S306, judging whether the user voice is finished.

S307, if the user voice is not finished, receiving a second voice corresponding to the user voice, iteratively executing the step of calling the first preset ASR model according to the preset dialogue coding identifier until the user voice is finished to finish the dialogue, and entering the step S308.

And S308, finishing the conversation if the voice of the user is finished.

S309, forming a dialogue voice corpus by the user voice and the preset answer, wherein the dialogue voice corpus comprises a preset dialogue coding identifier and the ASR dialogue coding identifier.

The Automatic Speech Recognition, which is abbreviated as "ASR" in english, can be divided into a "traditional" Recognition mode and an "end-to-end" Recognition mode, and the main difference is embodied in an acoustic model, wherein the acoustic model of the "traditional" mode generally adopts a hidden markov model (HMM in english short), and the "end-to-end" mode generally adopts a deep neural network (DNN in english short).

Specifically, in a dialogue of a user handling a service, the user and a self-service voice service are required to continuously interact, for example, the interaction is performed in a question-and-answer mode, when a dialogue management system starts to receive the voice of the user, a first voice of the user is received, a preset dialogue coding identifier of the conversation is generated, the preset dialogue coding identifier is used for tracking the conversation of the service, according to the preset dialogue coding identifier, a first preset ASR model is called, the first voice is converted into first user characters through the first preset ASR model, a first ASR dialogue coding identifier corresponding to the first voice is generated, the first ASR dialogue coding identifier is used for describing a calling ASR model corresponding to the first voice, and a first nlpreset u model is called to understand the first user characters to obtain a first user semantic meaning, according to the first user semantic meaning, screening a first preset answer corresponding to the first user semantic meaning from a preset database in a preset semantic matching mode, converting the first preset answer into first response voice to respond to the first voice, judging whether the user voice is finished, if the user voice is not finished, receiving second voice of the user voice, continuously calling a first preset ASR model to convert the second voice into second user characters through the first preset ASR model, generating a second ASR dialogue coding identification corresponding to the second voice, wherein the second ASR dialogue coding identification is used for describing a calling ASR model corresponding to the second voice, continuously calling a first NL preset U model to understand the second user characters to obtain second user semantic meaning, and according to the second user semantic meaning, continuously screening a second preset answer corresponding to the second user semantic meaning from a preset database in a preset semantic matching mode, converting the second preset answer into second response voice to respond to the second voice, judging whether the user voice is finished again, continuously receiving the third voice of the user voice if the user voice is not finished, iteratively executing the step of calling the first preset ASR model again according to the preset dialogue coding identification until the user voice is finished to finish the dialogue, finishing the dialogue if the user voice is finished, and forming dialogue voice corpora by using a plurality of times of voices contained in the conversation and the preset answers corresponding to the voices every time respectively, wherein the dialogue voice corpora comprise the preset dialogue coding identification and the ASR dialogue coding identification. For example, when a user dials a voice call to transact self-service, the dialog management system may generate a UniqueID, which is an ID for marking a one-pass dialog, record each time the user speaks and a response of the dialog management system, generate an asrsesionid each time the ASR model is invoked to convert the user spoken, the asrsesionid is used to mark an ASR interaction, the ASR system sends the result of the conversion to the dialog management system, the dialog management system invokes the NLU model to interpret the text, and, according to the interpretation result, selects a preset corresponding response from the database to respond, thereby implementing the interaction between the user and the voice service self-service computer device, and forming a dialog phonetic corpus during the interaction process, so that the ASR model and the NLU model are important in the whole process, the dialog identification is respectively set for the one-way dialog and each ASR calling, and the context of the one-way dialog can be associated through the dialog identification to form a complete interactive process, so that the subsequent ASR and NLU models can conveniently learn according to the context, and the accuracy of the speech recognition model training is improved.

In one embodiment, the corpus tagging system tags the dialogue speech corpus by a preset speech corpus tagging tool to obtain tagged speech corpus, including:

labeling the dialogue voice corpus by using a preset ASR labeling mode to obtain an ASR labeled voice corpus;

labeling the dialogue voice corpus by using a preset NLU labeling mode to obtain an NLU labeled voice corpus;

the step of the speech recognition model training system using the labeled speech corpus to train a second preset speech recognition model comprises:

a speech recognition model training system acquires the ASR labeled speech corpus and the NLU labeled speech corpus;

and training a second preset ASR model by using the ASR labeled speech corpus, and training a second preset NLU model by using the NLU labeled speech corpus.

The ASR model is an acoustic model, which integrates knowledge of acoustics and pronunciation to convert voice into text. When speech recognition is carried out to convert voice into characters, because the voice sent by the voice is continuous voice, a computer device does not know which phoneme or word corresponds to which part of the voice, the speech linguistic data of the dialogue is required to be labeled firstly in an ASR labeling mode, so that the phoneme or word can be automatically segmented for the voice, the factors or words of the voice are further converted into the characters correspondingly, the voice is converted into the characters through an ASR model, the ASR labeling mode is a mode of carrying out voice labeling on the linguistic data of the dialogue, ASR labeling can be realized through a voice labeling tool, and the voice labeling tool comprises a Praat tool, a Transcriber tool and the like.

The NLU model is a language model, which is used to estimate the possibility of assuming word sequences by training the corpus to learn the interrelations between words, also called as language model scores, and embodies that words are composed of words and sentences are composed of words, thereby expressing the relationships between words and words of the language text content, and the language model can usually realize more accurate estimation of language. The language models include SRILM, IRSTLM, MITLM, BerkeleyLM, etc. To convert characters into words and sentences with commonly described meanings through the NLU model, the obtained characters need to be labeled, so that the labeled characters form words and sentences with meanings through the NLU model. Therefore, after the speech is converted into the text, the obtained text needs to be labeled in a preset NLU labeling mode to obtain an NLU labeled speech corpus, so that the NLU labeled speech corpus is converted into words and sentences with meaning contents by using an NLU model, and finally the speech is converted into a commonly used text language. The NLU labeling mode for labeling the characters contained in the ASR labeled speech corpus can label the speech corpus through a corpus labeling tool, and the corpus labeling tool comprises a Brat corpus labeling tool, a Parker corpus labeling tool, a YEDDA corpus labeling tool, a Snorkel corpus labeling tool, a Prodigy corpus labeling tool and the like.

Specifically, in the present application, a user speech is received via a dialog management system, the user speech is recognized using a first ASR model and a first NLU model to obtain a recognition result, and responded according to the recognition result to form a dialog speech corpus, the dialog speech corpus is sent to a corpus tagging system, the corpus tagging system receives the dialog speech corpus, the dialog speech corpus is tagged to obtain a tagged speech corpus, the dialog speech corpus is tagged using a predetermined ASR tagging manner to obtain an ASR tagged speech corpus, the dialog speech corpus is tagged using a predetermined NLU tagging manner to obtain an NLU tagged speech corpus and sent to a file server, a model training system obtains the tagged speech corpus from the file server, the ASR tagged speech corpus and the NLU tagged speech corpus are trained on a second ASR model and a second NLU model, respectively, therefore, the real-time ASR model and the NLU model of the real language material are trained, the speech recognition accuracy rate has a great relation with the speech material adopted by the language model, the ASR models corresponding to the respective services are trained by using the real speech material generated by the respective service scenes aiming at each service, the speech recognition accuracy rate of the ASR model can be improved, the speech recognition accuracy rate of the ASR model is improved, the understanding accuracy rate of the NLU model can be promoted, the accuracy rate of the whole speech recognition is improved finally, and the self-service completion rate of the intelligent customer service robot is improved finally.

In an embodiment, before the step of determining whether the trained second preset speech recognition model satisfies the preset dialog completion rate condition, the method further includes:

the dialogue management system calls the second preset ASR model and the second preset NLU model to recognize the received new user voice and responses the new user voice to complete dialogue;

counting a first completion rate of a dialog completed by recognizing the user voice through the first preset ASR model and the first preset NLU model within a preset time period;

counting a second completion rate of the conversation completed by recognizing the new user voice by the second preset ASR model and the second preset NLU model in the preset time period;

the step of judging whether the trained second preset speech recognition model meets the preset conversation completion rate condition comprises the following steps:

judging whether the second completion rate is greater than the first completion rate;

and if the second completion rate is greater than the first completion rate, judging that the trained second preset speech recognition model meets a preset conversation completion rate condition.

Specifically, aiming at a first preset ASR model and a first preset NLU model in a preset time period, and a second preset ASR model and a second preset NLU model in the preset time period, respectively counting completion rates of the first preset ASR model and the second preset NLU model in the same time period, judging whether the second completion rate is greater than the first completion rate, if the second completion rate is greater than the first completion rate, judging that the trained second preset speech recognition model meets a preset dialogue completion rate condition, replacing the second preset ASR model with the first preset ASR model, and replacing the second preset NLU model with the first preset NLU model. For self-service voice services provided by the dialogue management system in a preset time period, then counting the completion rate of each self-service voice service, i.e. the interactive result is counted as success without switching the dialog corresponding to the manual self-service voice service result, if the trained second preset ASR model and the second preset NLU model are adopted, when the self-service speech service completion rate is improved, the second preset ASR model and the new model corresponding to the second preset NLU model are adopted, otherwise, if the self-service voice service completion rate is not improved after the trained second preset ASR model and the trained second preset NLU model are adopted, and continuing to use the old model corresponding to the first preset ASR model and the first preset NLU model, and continuing to train the second preset ASR model and the second preset NLU model. For example, the session management system may count the completion rate of each self-service voice service every month, if the completion rate is improved after the trained second preset voice recognition model is adopted, the second preset voice recognition model is adopted, otherwise, the first preset voice recognition model is continuously used to complete the session of the user handling the service, and the training is continuously performed for the second preset voice recognition model.

Because the speech recognition accuracy rate has a great relationship with the language model adopted by the speech recognition accuracy rate, and the training of the language model has a direct relationship with the adopted speech recognition linguistic data, the embodiment of the application adopts the real speech recognition linguistic data corresponding to each service, so that different ASR neural network language models are trained, the speech recognition accuracy rate is improved, and the understanding accuracy rate of an NLU model can be promoted, so that a service-specific linguistic data labeling system and a model training system can be constructed, and particularly for a customer service robot, the construction of the service-specific linguistic data labeling system and the model training system is extraordinarily important for improving the accuracy and efficiency of speech recognition, so that the speech recognition accuracy can be realized on various types of professional services, and the self-service completion rate of the intelligent customer service robot is finally improved.

It should be noted that, the dialog management processing method based on speech recognition described in the foregoing embodiments may recombine the technical features included in different embodiments as needed to obtain the combined implementation, but all of them are within the protection scope of the present application.

Referring to fig. 4, fig. 4 is a schematic block diagram of a dialog management processing device based on speech recognition according to an embodiment of the present application. Corresponding to the above dialog management processing method based on voice recognition, the embodiment of the present application further provides a dialog management processing apparatus based on voice recognition. As shown in fig. 4, the dialog management processing device based on speech recognition includes a unit for executing the above-mentioned dialog management processing method based on speech recognition, and the dialog management processing device based on speech recognition can be configured in a computer device. Specifically, referring to fig. 4, the dialog management processing apparatus 400 based on speech recognition includes a dialog unit 401, a labeling unit 402, a training unit 403, a determining unit 404, and a replacing unit 405.

The dialogue unit 401 is configured to receive a user voice through a dialogue management system, so that the dialogue management system invokes a first preset voice recognition model to recognize the user voice to obtain a recognition result, and responds to the user voice according to the recognition result to complete a dialogue, so that the dialogue forms a dialogue voice corpus;

a labeling unit 402, configured to send the conversational speech corpus to a corpus labeling system through a first preset message middleware, so that the corpus labeling system labels the conversational speech corpus through a preset speech corpus labeling tool to obtain a labeled speech corpus;

a training unit 403, configured to send the labeled speech corpus to a speech recognition model training system through a second preset message middleware, so that the speech recognition model training system trains a second preset speech recognition model by using the labeled speech corpus;

a determining unit 404, configured to determine whether the trained second preset speech recognition model meets a preset conversation completion rate condition, where the conversation completion rate is a ratio of the number of conversations completed based on speech recognition in the preset time period to all the number of conversations in the preset time period;

a replacing unit 405, configured to replace the first preset speech recognition model with the trained second preset speech recognition model to be called by the dialog management system to complete a new dialog if the trained second preset speech recognition model meets the preset dialog completion rate condition.

In an embodiment, the dialog speech corpus includes a plurality of speech corpuses respectively formed by a plurality of dialogues, the speech corpus formed by each dialog includes an interaction result corresponding to the dialog, the interaction result includes a switching manual, and the labeling unit 402 includes:

the recognition subunit is used for recognizing that the interaction result is a switching artificial voice corpus;

the eliminating subunit is used for eliminating the interaction result in the dialogue voice corpus as a switching artificial voice corpus to obtain a screened dialogue voice corpus;

and the labeling subunit is used for labeling the screened dialogue voice linguistic data through a preset voice linguistic data labeling tool so as to obtain a labeled voice linguistic data.

In one embodiment, the dialog unit 401 includes:

the first receiving subunit is used for receiving a first voice corresponding to the user voice and generating a preset dialogue coding identifier of a dialogue corresponding to the user voice;

the first calling subunit is used for calling a first preset ASR model according to the preset dialogue coding identifier, converting the first voice into user characters through the first preset ASR model, and generating an ASR dialogue coding identifier corresponding to the calling based on the preset dialogue coding identifier;

the second calling subunit is used for calling the first preset NLU model to understand the user characters so as to obtain user semantics;

the screening subunit is used for screening out a preset answer corresponding to the user semantics from a preset database in a preset semantic matching mode according to the user semantics;

the response subunit is used for converting the preset answers into response voice and sending the response voice to a user to respond to the first voice;

the first judging subunit is used for judging whether the user voice is finished or not;

the second receiving subunit is used for receiving a second voice corresponding to the user voice if the user voice is not finished, and iteratively executing the step of calling the first preset ASR model according to the preset dialogue coding identifier until the user voice is finished so as to finish the dialogue;

and the dialogue forming subunit is used for finishing dialogue if the user voice is finished, and forming dialogue voice corpora by the user voice and the preset answer, wherein the dialogue voice corpora comprises the preset dialogue coding identifier and the ASR dialogue coding identifier.

In one embodiment, the labeling unit 402 includes:

the first labeling subunit is used for labeling the dialogue speech corpus by using a preset ASR labeling mode to obtain an ASR labeled speech corpus;

the second labeling subunit is used for labeling the dialogue voice corpus by using a preset NLU labeling mode to obtain an NLU labeling voice corpus;

the training unit 403 includes:

an obtaining subunit, configured to obtain, by a speech recognition model training system, the ASR tagged speech corpus and the NLU tagged speech corpus;

and the training subunit is used for training the second preset ASR model by using the ASR labeling speech corpus and training the second preset NLU model by using the NLU labeling speech corpus.

In one embodiment, the speech recognition based dialog management processing apparatus 400 further comprises:

the call unit is used for the dialogue management system to call the second preset ASR model and the second preset NLU model to recognize the received new user voice and respond to the new user voice to complete the dialogue;

the first statistical unit is used for counting a first completion rate of a conversation completed by recognizing the user voice through the first preset ASR model and the first preset NLU model in a preset time period;

the second statistical unit is used for counting a second completion rate of the conversation completed by recognizing the new user voice by the second preset ASR model and the second preset NLU model in the preset time period;

the judgment unit 404 includes:

a second determining subunit, configured to determine whether the second completion rate is greater than the first completion rate;

and the judging subunit is used for judging that the trained second preset speech recognition model meets a preset conversation completion rate condition if the second completion rate is greater than the first completion rate.

It should be noted that, as can be clearly understood by those skilled in the art, for the specific implementation process of the dialog management processing apparatus based on speech recognition and each unit, reference may be made to the corresponding description in the foregoing method embodiment, and for convenience and brevity of description, no further description is provided here.

Meanwhile, the division and connection manner of each unit in the dialog management processing device based on voice recognition are only used for illustration, in other embodiments, the dialog management processing device based on voice recognition may be divided into different units as required, or each unit in the dialog management processing device based on voice recognition may adopt different connection order and manner to complete all or part of the functions of the dialog management processing device based on voice recognition.

The above-described dialog management processing apparatus based on speech recognition may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 5.

Referring to fig. 5, fig. 5 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a computer device such as a desktop computer or a server, or may be a component or part of another device.

Referring to fig. 5, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, causes the processor 502 to perform a dialog management processing method based on speech recognition as described above.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can be enabled to execute a dialog management processing method based on voice recognition.

The network interface 505 is used for network communication with other devices. Those skilled in the art will appreciate that the configuration shown in fig. 5 is a block diagram of only a portion of the configuration associated with the present application and does not constitute a limitation of the computer device 500 to which the present application may be applied, and that a particular computer device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 5, and are not described herein again.

Wherein the processor 502 is configured to run the computer program 5032 stored in the memory to implement the following steps: receiving user voice through a dialogue management system, so that the dialogue management system calls a first preset voice recognition model to recognize the user voice to obtain a recognition result, responses the user voice according to the recognition result to complete dialogue, and the dialogue forms dialogue voice corpus; sending the dialogue voice corpus to a corpus tagging system through a first preset message middleware, so that the corpus tagging system tags the dialogue voice corpus through a preset voice corpus tagging tool to obtain a tagged voice corpus; sending the marked voice corpus to a voice recognition model training system through a second preset message middleware, so that the voice recognition model training system trains a second preset voice recognition model by using the marked voice corpus; judging whether the trained second preset voice recognition model meets a preset conversation completion rate condition, wherein the conversation completion rate is the proportion of the number of conversations completed based on voice recognition in the preset time period to all the number of conversations in the preset time period; and if the trained second preset voice recognition model meets the preset conversation completion rate condition, replacing the first preset voice recognition model with the trained second preset voice recognition model so as to be called by the conversation management system to complete a new conversation.

In an embodiment, the dialogue voice corpus includes a plurality of voice corpora respectively formed by a plurality of dialogues, the voice corpus formed by each dialog includes an interaction result corresponding to the dialog, the interaction result includes a manual transfer, and the processor 502 specifically implements the following steps when implementing the step of labeling the dialogue voice corpus by using a preset voice corpus labeling tool to obtain a labeled voice corpus:

recognizing that the interaction result is a manual switching voice corpus;

In an embodiment, when implementing the steps of the dialog management system invoking a first preset speech recognition model to recognize the user speech to obtain a recognition result, responding to the user speech according to the recognition result to complete a dialog, and forming the dialog into a dialog speech corpus, the processor 502 specifically implements the following steps:

receiving a first voice corresponding to a user voice, and generating a preset dialogue coding identifier of a dialogue corresponding to the user voice;

calling a first preset ASR model according to the preset dialogue coding identification so as to convert the first voice into user characters through the first preset ASR model, and generating an ASR dialogue coding identification corresponding to the calling based on the preset dialogue coding identification;

calling a first preset NLU model to understand the user characters to obtain user semantics;

screening out a preset answer corresponding to the user semantics from a preset database in a preset semantic matching mode according to the user semantics;

converting the preset answers into response voices so as to respond to the first voice;

judging whether the user voice is finished or not;

if the user voice is not finished, receiving a second voice corresponding to the user voice, and iteratively executing the step of calling a first preset ASR model according to the preset dialogue coding identifier until the user voice is finished so as to finish the dialogue;

and if the user voice is finished, completing the dialogue, and forming a dialogue voice corpus by the user voice and the preset answer, wherein the dialogue voice corpus comprises the preset dialogue coding identifier and the ASR dialogue coding identifier.

In an embodiment, when the corpus tagging system implements the step of tagging the conversational speech corpus with a preset speech corpus tagging tool to obtain a tagged speech corpus, the processor 502 specifically implements the following steps:

when the processor 502 implements the step of the speech recognition training system using the labeled speech corpus to train the second preset speech recognition model, the following steps are specifically implemented:

In an embodiment, before the step of determining whether the trained second preset speech recognition model meets the preset dialog completion rate condition, the processor 502 further performs the following steps:

when the processor 502 performs the step of determining whether the trained second preset speech recognition model meets the preset dialog completion rate condition, the following steps are specifically performed:

It should be understood that, in the embodiment of the present Application, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will be understood by those skilled in the art that all or part of the processes in the method for implementing the above embodiments may be implemented by a computer program, and the computer program may be stored in a computer readable storage medium. The computer program is executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present application also provides a computer-readable storage medium. The computer readable storage medium may be a non-volatile computer readable storage medium, the computer readable storage medium storing a computer program that, when executed by a processor, causes the processor to perform the steps of:

a computer program product which, when run on a computer, causes the computer to perform the steps of the speech recognition based dialog management processing method described in the embodiments above.

The computer readable storage medium may be an internal storage unit of the aforementioned device, such as a hard disk or a memory of the device. The computer readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the apparatus.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The storage medium is an entity and non-transitory storage medium, and may be various entity storage media capable of storing computer programs, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a magnetic disk, or an optical disk.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing an electronic device (which may be a personal computer, a terminal, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A dialog management processing method based on speech recognition, the method comprising:

receiving user voice through a dialogue management system, so that the dialogue management system calls a first preset voice recognition model to recognize the user voice to obtain a recognition result, responses the user voice according to the recognition result to complete dialogue, and the dialogue forms dialogue voice corpus;

sending the dialogue voice corpus to a corpus tagging system through a first preset message middleware, so that the corpus tagging system tags the dialogue voice corpus through a preset voice corpus tagging tool to obtain a tagged voice corpus;

sending the marked voice corpus to a voice recognition model training system through a second preset message middleware, so that the voice recognition model training system trains a second preset voice recognition model by using the marked voice corpus;

judging whether the trained second preset voice recognition model meets a preset conversation completion rate condition, wherein the conversation completion rate is the proportion of the number of conversations completed based on voice recognition in the preset time period to all the number of conversations in the preset time period;

and if the trained second preset voice recognition model meets the preset conversation completion rate condition, replacing the first preset voice recognition model with the trained second preset voice recognition model so as to be called by the conversation management system to complete a new conversation.

2. The speech recognition-based dialogue management processing method according to claim 1, wherein the dialogue speech corpus comprises a plurality of speech corpora formed by a plurality of dialogues respectively corresponding to the dialogues, the speech corpus formed by each dialog comprises an interaction result corresponding to the dialog, the interaction result comprises a manual transfer, and the step of labeling the dialogue speech corpus by a preset speech corpus labeling tool to obtain a labeled speech corpus comprises:

recognizing that the interaction result is a manual switching voice corpus;

3. The method as claimed in claim 1 or 2, wherein the step of the dialog management system invoking a first preset speech recognition model to recognize the user speech to obtain a recognition result, and responding to the user speech according to the recognition result to complete a dialog, and the step of forming the dialog into a dialog speech corpus comprises:

converting the preset answers into response voices, and sending the response voices to the user to respond to the first voice;

judging whether the user voice is finished or not;

4. The speech recognition-based dialogue management processing method according to claim 3, wherein the corpus tagging system tags the dialogue speech corpus with a preset speech corpus tagging tool to obtain tagged speech corpus, comprising:

5. The method of claim 4, wherein before the step of determining whether the trained second predetermined speech recognition model satisfies the predetermined speech completion rate condition, the method further comprises:

6. A speech recognition-based dialog management processing apparatus, comprising:

the dialogue unit is used for receiving user voice through a dialogue management system, so that the dialogue management system calls a first preset voice recognition model to recognize the user voice to obtain a recognition result, responses are made to the user voice according to the recognition result to complete dialogue, and the dialogue forms dialogue voice linguistic data;

the labeling unit is used for sending the dialogue voice corpus to a corpus labeling system through a first preset message middleware so that the corpus labeling system labels the dialogue voice corpus through a preset voice corpus labeling tool to obtain a labeled voice corpus;

the training unit is used for sending the marked voice corpus to a voice recognition model training system through a second preset message middleware so that the voice recognition model training system trains a second preset voice recognition model by using the marked voice corpus;

the judging unit is used for judging whether the trained second preset voice recognition model meets a preset conversation completion rate condition or not, wherein the conversation completion rate is the proportion of the number of conversations completed based on voice recognition in the preset time period to all the number of conversations in the preset time period;

and the replacing unit is used for replacing the first preset voice recognition model by the trained second preset voice recognition model to be called by the dialogue management system to complete new dialogue if the trained second preset voice recognition model meets the preset dialogue completion rate condition.

7. The apparatus according to claim 6, wherein the dialogue speech corpus comprises a plurality of speech corpuses corresponding to a plurality of dialogues, the speech corpus formed in each dialogue comprises an interaction result corresponding to the dialogue, the interaction result comprises a manual transfer, and the labeling unit comprises:

8. The speech recognition-based dialog management processing apparatus of claim 6 or 7, wherein the dialog unit comprises:

9. A computer device, comprising a memory and a processor coupled to the memory; the memory is used for storing a computer program; the processor is adapted to run the computer program to perform the steps of the method according to any of claims 1-5.

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when being executed by a processor, realizes the steps of the method according to any one of claims 1 to 5.