CN115662431A

CN115662431A - Voice robot communication method and device adopting high-generalization multitask intention recognition

Info

Publication number: CN115662431A
Application number: CN202211099250.6A
Authority: CN
Inventors: 马达标; 李蒙
Original assignee: Beihai Qicheng Information Technology Co ltd
Current assignee: Beihai Qiang Information Technology Co ltd
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2023-01-31

Abstract

The application relates to a voice robot communication method and device adopting high-generalization multitask intention recognition. The method comprises the following steps: the voice robot establishes voice communication with a user, and acquires real-time voice text data of the user in the communication process; inputting the real-time voice text data into a multitask intention recognition model to generate a replacement text, an intention category and a prediction text; determining the accuracy of the multitask intention recognition model according to the replacement text and the prediction text; updating a channel conversation based on the intent classification when the accuracy rate is greater than a threshold; and the voice robot continues to carry out voice communication with the user according to the updated communication technique. The method can enable the generalization capability of the intention model to be stronger, so that the intention of the user can be analyzed more accurately, and the user can be in more smooth voice communication with the user.

Description

Voice robot communication method and device adopting high-generalization multitask intention recognition

Technical Field

The application relates to the field of computer information processing, in particular to a voice robot call method and device adopting high-generalization multitask intention recognition, electronic equipment and a computer readable medium.

Background

The intelligent voice robot automatically initiates an intelligent robot call-out task according to a service scene based on technologies such as voice recognition and synthesis, machine learning, natural language understanding and the like, collects service results through voice conversation interaction between people and the robot, performs statistical processing on data, and acquires user feedback. The intelligent voice robot is a conversation intelligent robot facing developers, and can realize intelligent conversation based on Natural Language Processing (NLP) on different message terminals, such as websites, APPs, entity robots and the like. The user can configure own specific knowledge base to realize intelligent question answering, and can also realize self-service through integration of multiple rounds of conversation and a third party AP I, such as: order inquiry, logistics tracking, self-service goods returning robot and the like. The intelligent voice robot can analyze the conversation content and mine possible problems and opportunities in the conversation from the conversation recording or the conversation text based on intelligent rules. The system can help enterprises to improve service quality, monitor public opinion risks and optimize service strategies, and typical application scenes comprise intelligent customer service quality inspection, sales opportunity analysis and the like.

Intent recognition is an important branch of natural language understanding, at the heart of the robot scenario. Generally, the most common method for identifying intent in the industry at present mainly utilizes a pre-training model of text. That is, on a pre-trained model that has already been trained (possibly downloaded or retrained), the (text, intent) data is fine-tuned on this model, resulting in an intent recognition model. However, this method has a big disadvantage that the original information in the pre-trained model is destroyed due to the change of the model parameters during the fine tuning process. For the above reasons, the intention recognition model trained in the prior art is not highly accurate in user intention recognition.

Therefore, there is a need for a new voice robot call method, apparatus, electronic device and computer readable medium using highly generalized multi-task intent recognition.

The above information disclosed in this background section is only for enhancement of understanding of the background of the application and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

In view of the above, the present application provides a voice robot communication method, apparatus, electronic device and computer readable medium using highly generalized multitask intention recognition, which can embed a multitask training method into a training flow of an intention model, and prevent a model obtained in a pre-training stage from losing too much pre-training information when the intention model is trained, so that the generalization capability of the intention model is stronger, and thus the user intention can be analyzed more accurately, and more smooth voice communication with the user can be performed.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of the present application, a voice robot call method using highly generalized multitask intention recognition is provided, the method comprising: the voice robot establishes voice communication with a user, and acquires real-time voice text data of the user in the communication process; inputting the real-time voice text data into a multitask intention recognition model to generate a replacement text, an intention category and a prediction text; determining the accuracy of the multitask intention recognition model according to the replacement text and the prediction text; updating a channel conversation based on the intent classification when the accuracy rate is greater than a threshold; and the voice robot continues to carry out voice communication with the user according to the updated communication technology.

Optionally, the method further comprises: acquiring historical voice text data; setting an intention category label for the historical voice text data; training a multitask-based improved BERT language model through historical speech text data with intention category labels to generate the multitask intention recognition model.

Optionally, training the multitask-based refined BERT language model with historical speech text data with intent category labels to generate the multitask intent recognition model comprises: training the BERT language model through historical voice text data with intention category labels to generate a pre-training model; replacing partial characters in the historical voice text data to generate input text data; inputting the input text data into the pre-training model based on multi-task improvement to obtain a predicted text and a predicted intention classification; and when the loss functions corresponding to the predicted texts and the predicted intention classification meet a preset strategy, generating the multitask intention recognition model.

Optionally, training the BERT language model by the historical speech text data with the intention category label generates a pre-trained model, including: performing word vector conversion on all characters in the historical voice text data to generate a word embedding tensor, a statement block tensor and a position coding tensor of each character; generating a character vector of each character based on the word embedding tensor, the sentence blocking tensor and the position coding tensor; and training the BERT language model through the character word vectors with the intention category labels to generate the pre-training model.

Optionally, performing a replacement operation on a part of characters in the historical speech text data to generate input text data, including: extracting characters with a preset proportion for replacement operation; storing partial characters before replacement operation; the input text data is generated by character vectors of the unsubstituted characters and the substituted characters.

Optionally, inputting the input text data into the pre-training model after the multi-task improvement to obtain a predicted text and a predicted intention classification, including: inputting the input text data to the pre-training model; the pre-training model performs intention prediction on the input text based on a bidirectional coding mechanism; the pre-training model identifies the replaced characters in the input text based on a multitasking mechanism; and the pre-training model generates a prediction text and prediction intention classification according to the calculation result.

Optionally, when the loss function corresponding to the predicted text and the predicted intent classification meets a preset policy, generating the multitask intent recognition model includes: comparing the similarity of the predicted text and the replaced characters to generate a first comparison result; comparing the predicted intention classification with the intention label to generate a second comparison result; and when the first comparison result is larger than a text threshold and the second comparison result is larger than an intention threshold, generating the multitask intention recognition model according to the current parameters of the pre-training model.

Optionally, the voice robot establishes a voice call with the user, and acquires real-time voice text data of the user during the call, including: determining a channel conversation technique according to the user information; the voice robot carries out voice call with the user based on the ditch call technology; and converting the real-time voice data of the user into voice text data through voice recognition.

Optionally, determining an accuracy of the multitask intention recognition model according to the replacing text and the predicted text comprises: comparing the similarity of the replacement text and the predicted text; and determining the accuracy of the intention category according to the similarity comparison result.

Optionally, the method further comprises: and when the accuracy is less than or equal to a threshold value, the voice robot continues to carry out voice communication with the user according to the original channel communication technology.

According to an aspect of the present application, a voice robot communicator employing highly generalized multitask intention recognition is provided, the device comprising: the text module is used for establishing voice communication between the voice robot and the user and acquiring real-time voice text data of the user in the communication process; the recognition module is used for inputting the real-time voice text data into a multitask intention recognition model to generate a replacement text, an intention category and a prediction text; the judging module is used for determining the accuracy of the multitask intention recognition model according to the replacement text and the prediction text; an update module to update a ditch call technique based on the intent classification when the accuracy rate is greater than a threshold; and the call module is used for continuing the voice call with the user by the voice robot according to the updated communication technique.

According to an aspect of the present application, an electronic device is provided, the electronic device including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above.

According to an aspect of the application, a computer-readable medium is proposed, on which a computer program is stored, which program, when being executed by a processor, carries out the method as above.

According to the voice robot call method, the device, the electronic equipment and the computer readable medium which adopt the high-generalization multitask intention recognition, the voice call is established between the voice robot and the user, and the real-time voice text data of the user is obtained in the call process; inputting the real-time voice text data into a multitask intention recognition model to generate a replacement text, an intention category and a prediction text; determining the accuracy of the multitask intention recognition model according to the replacement text and the prediction text; updating a channel conversation based on the intent classification when the accuracy rate is greater than a threshold; the voice robot can embed the multi-task training method into the training process of the intention model according to the mode that the updated communication technique continues to carry out voice communication with the user, and prevents the model obtained in the pre-training stage from losing too much pre-training information when the intention model is trained, so that the generalization capability of the intention model is stronger, the intention of the user can be analyzed more accurately, and the voice communication with the user is smoother.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are only some embodiments of the present application, and other drawings may be derived from those drawings by those skilled in the art without inventive effort.

FIG. 1 is a flow diagram illustrating a voice robot telephony method employing highly generalized multi-tasking intent recognition in accordance with an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a voice robot telephony method employing highly generalized multi-tasking intent recognition in accordance with another exemplary embodiment.

FIG. 3 is a schematic diagram illustrating a voice robot telephony method employing highly generalized multi-tasking intent recognition in accordance with another exemplary embodiment.

FIG. 4 is a schematic diagram illustrating a voice robot telephony method employing highly generalized multi-tasking intent recognition in accordance with another exemplary embodiment.

FIG. 5 is a schematic diagram illustrating a voice robot telephony method employing highly generalized multitask intent recognition according to another exemplary embodiment.

FIG. 6 is a block diagram illustrating a voice robot communicator employing highly generalized multi-tasking intent recognition in accordance with an exemplary embodiment.

FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flowcharts shown in the figures are illustrative only and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below could be termed a second component without departing from the teachings of the present concepts. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be appreciated by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or processes shown in the drawings are not necessarily required to practice the present application and are, therefore, not intended to limit the scope of the present application.

The technical terms involved in the present application are explained as follows:

an intention recognition model: models for predicting intent from text (audio or other information) consist primarily of deep learning models. However, in the actual production process, the intention recognition system generally performs prediction by a model + rule method.

Multi-task learning: the aim of the method is to simultaneously perform a plurality of tasks and possibly strengthen the capacity of one task.

Self-supervision training: it refers to a training mode with only data and no target label. For example, there is only a large amount of text. In this case, the target tag can be obtained by modifying the original data. For example, for the text "Wednesday today", the data can be reformed using the mask's method to: data "today # is # phase three" and label "stars".

Pre-training the model: the method is characterized in that a large amount of data without target labels, an automatic supervision training method and a training model are utilized, the obtained model shows the internal characteristics of a single data instance, and after the data is input into the model, the abstract characteristics representing the data can be obtained, and the abstract characteristics have extremely strong generalization capability. For example, BERT using text data, wav2vec2.0 using audio data. The pre-training model is generally used in two ways: firstly, inputting data to a pre-training model to obtain abstract characteristics, and then inputting the abstract characteristics to other models for training; and secondly, fine tuning is directly carried out on the pre-training model.

FIG. 1 is a flow diagram illustrating a voice robot telephony method employing highly generalized multi-tasking intent recognition in accordance with an exemplary embodiment. The voice robot call method 10 using highly generalized multitask intention recognition includes at least steps S102 to S110.

As shown in fig. 1, in S102, the voice robot establishes a voice call with the user, and acquires real-time voice text data of the user during the call. The channel technique may be determined, for example, from user information; the voice robot carries out voice call with the user based on the ditch call technology; and converting the real-time voice data of the user into voice text data through voice recognition.

In the embodiment of the application, the user may be an individual user or an enterprise user, where the user information may include basic information authorized by the user, such as service account information, user terminal device identification information, user location region information, and the like; the user information may also include behavior information, which may be, for example, page operation data of the user, service access duration of the user, service access frequency of the user, and the like, and specific content of the user information may be determined according to an actual application scenario, which is not limited herein. More specifically, the user information of the current user can be obtained in a webpage point burying mode based on user authorization. The remote information can be user data of the user on other transaction platforms or other business departments.

In one embodiment, the user state information in the user information may indicate that the user is an unregistered user, and the intelligent robot may determine that the ditch call technology is the drainage registration, and perform a voice call with the user according to a call flow configured by the drainage registration technology.

In S104, the real-time speech text data is input into a multitask intention recognition model, and a replacement text, an intention category and a predicted text are generated.

In one embodiment, the multitask intention model of the application can perform multitask training and calculation, and can generate a plurality of calculation results in model calculation, and more specifically, the calculation results can be replacement text, intention categories and predicted text. The intent categories may be tags that identify intent categories of the user, which may be, for example: "very interesting", "prepare application", "not interesting", "to consider", "in comparison product".

The replacement text and the prediction text comprise character vectors of a plurality of characters, and the replacement text and the prediction text can be displayed on a user side in a text mode or not displayed in a user interface and only stored in a background so as to be used in the following steps.

In S106, the accuracy of the intention category in the multitask intention recognition model is determined according to the replacing text and the prediction text. The alternative text and the predicted text may be compared for similarity; and determining the accuracy of the idea category of the multi-task intention recognition model according to the similarity comparison result.

The similarity comparison can be carried out on the plurality of character vectors in the replacement text and the plurality of character vectors in the prediction text one by one, and a similarity comparison result is generated. The similarity comparison result shows the similarity between the alternative text and the predicted text.

The incidence relation between the similarity and the accuracy of the calculation of the intention recognition model can be established through multiple calculation results, so that the accuracy of the intention recognition model is determined according to the similarity.

In S108, when the accuracy is greater than a threshold, updating a channel conversation technique based on the intention category, and the voice robot continuing to perform a voice conversation with the user according to the updated channel conversation technique.

When the accuracy is larger than the threshold, a new communication technique can be generated according to the intention of the current user, and communication is carried out according to the new channel communication technique.

In S110, when the accuracy is less than or equal to a threshold, the voice robot continues to perform voice communication with the user according to the original ditch call technique.

When the accuracy is less than or equal to the threshold, the intention of the current user is not known, and the communication can be continued by using the original ditch conversation technique.

According to the voice robot call method adopting the highly generalized multitask intention recognition, the voice call is established between the voice robot and the user, and the real-time voice text data of the user is acquired in the call process; inputting the real-time voice text data into a multitask intention recognition model to generate a replacement text, an intention category and a prediction text; determining the accuracy of the multitask intention recognition model according to the replacement text and the prediction text; updating a channel conversation based on the intent classification when the accuracy rate is greater than a threshold; the voice robot can embed the multitask training method into the training process of the intention model according to the mode that the updated communication technology continues to carry out voice communication with the user, and prevents the model obtained in the pre-training stage from losing too much pre-training information when the intention model is trained, so that the generalization capability of the intention model is stronger, the intention of the user can be analyzed more accurately, and the voice communication with the user is smoother.

It should be clearly understood that this application describes how to make and use particular examples, but the principles of this application are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.

In one embodiment, further comprising: acquiring historical voice text data; setting an intention category label for the historical voice text data; training a multitask-based improved BERT language model through historical speech text data with intention category labels to generate the multitask intention recognition model.

FIG. 2 is a flow diagram illustrating a voice robot telephony method employing highly generalized multitask intent recognition according to another exemplary embodiment. The process 20 shown in FIG. 2 is a detailed description of "training the multitask-based refined BERT language model with historical speech text data tagged with intent categories to generate the multitask intent recognition model".

As shown in fig. 2, in S202, a BERT language model is trained by historical speech text data with intent class labels to generate a pre-trained model. For example, all characters in the historical speech text data are converted into word vectors to generate a word embedding tensor, a sentence blocking tensor and a position coding tensor of each character; generating a character vector of each character based on the word embedding tensor, the sentence blocking tensor and the position coding tensor; and training the BERT language model through the character word vectors with the intention category labels to generate the pre-training model.

The specific process of training the BERT language model to generate a pre-trained model by using historical speech text data with intent class labels may be as shown in fig. 3: the method comprises the steps of enabling a text T1 \8230andT 6 to be replaced by a mask through mask operation, then transmitting the text after the mask to a BERT, enabling the BERT to try to predict T2 and T5 which are removed by the mask, and generating a pre-training model when a loss function in training reaches a preset strategy.

In S204, a replacement operation is performed on a part of characters in the history speech text data, and input text data is generated. For example, characters with a preset proportion are extracted for replacement operation; storing a part of characters before the replacement operation; the input text data is generated by character vectors of the unsubstituted characters and the substituted characters.

And continuing to use the mask operation in the above to replace part of characters to generate input text data. Word vector conversion can be continuously carried out on the characters in the text data to generate a character vector of each character.

In S206, the input text data is input to the pre-training model after being improved based on multi-task, so as to obtain a predicted text and a predicted intention classification. The input text data may, for example, be input to the pre-trained model; the pre-training model performs intention prediction on the input text based on a bidirectional coding mechanism; the pre-training model identifies the replaced characters in the input text based on a multitasking mechanism; and the pre-training model generates a prediction text and prediction intention classification according to the calculation result.

The application process of the pre-training model in the prior art can be as shown in fig. 4: and (4) taking the BERT model obtained in the pre-training stage as a starting point, and training an intention model by using a [ text, intention ] data set.

In one embodiment of the present application, the process of inputting the input text data into the pre-training model based on the multitask improvement to obtain the predicted text and the predicted intention classification may be as shown in fig. 5.

In the multitask-based intention model training method, in the practical application process, as shown in fig. 5, a text T1 \8230andT 6 is subjected to mask operation, T3 and T7 are replaced by masks, then the texts after the masks are transmitted to BERT, the BERT tries to predict T3 and T7 which are removed by the masks, and the intention classification of a user is calculated.

Because the text input by the original intention model in the training stage is not subjected to mask operation, and the text input by the pre-training stage is actually subjected to mask operation, the generalization capability of the model can be enhanced by adding the mask operation in the application, and the model is matched with the pre-training stage.

The multitask training is added, when the multitask intention model is trained, the BERT model is required to accurately recognize the intention, and meanwhile, the BERT model is required to still complete the same task as the pre-training stage, and by the method, the BERT model can still keep the generalization capability of the pre-training stage when the intention model is trained.

In S208, when the loss function corresponding to the predicted text and the predicted intent classification satisfies a preset policy, the multitask intent recognition model is generated. The similarity comparison between the predicted text and the replaced character can be performed, for example, to generate a first comparison result; comparing the predicted intention classification with the intention label to generate a second comparison result; and when the first comparison result is larger than a text threshold and the second comparison result is larger than an intention threshold, generating the multitask intention recognition model according to the current parameters of the pre-training model.

Specifically, a temporary multitask intention recognition model is respectively constructed for each sample data, object information of each object in the object set is input into the temporary multitask intention recognition model to obtain a prediction label, the prediction label is compared with a corresponding real label, whether the prediction label is consistent with the real label or not is judged, the number of the prediction labels consistent with the real label is counted, the proportion of the number of the prediction labels consistent with the real label in the number of all the prediction labels is calculated, if the proportion is larger than or equal to a preset proportion value, the temporary multitask intention recognition model converges to obtain a trained multitask intention recognition model, if the proportion is smaller than the preset proportion value, parameters in the temporary multitask intention recognition model are adjusted, and the prediction label of each object is re-predicted through the adjusted temporary multitask intention recognition model until the proportion is larger than or equal to the preset proportion value. The method for adjusting the parameters in the temporary multitask intention recognition model can be performed by adopting a random gradient descent algorithm, a gradient descent algorithm or a normal equation.

If the number of times of adjusting the parameters of the temporary multitask intention recognition model exceeds the preset number of times, the model used for constructing the temporary multitask intention recognition model can be replaced, so that the model training efficiency is improved.

The most common method for identifying intent in the industry at present is mainly as shown in fig. 4, and utilizes a pre-training model of text. That is, on a pre-trained model that has already been trained (possibly downloaded or retrained), the [ text, intent ] data is fine-tuned on this model, resulting in an intent recognition model. However, this method has a big disadvantage that the original information in the pre-trained model is destroyed due to the change of the model parameters during the fine tuning process. The reason for this is that the intention recognition task and the pre-training self-supervision task are two distinct tasks, and although pre-training is to generate the BERT parameters with strong generalization ability, the change of the parameters during the intention model training will cause the reduction of the generalization ability. Accordingly, a method for enhancing the generalization ability of an intent model using multitask learning is presented herein.

In the application, by embedding a multi-task training method (the added tasks are consistent with the tasks in the pre-training stage) into the training process of the intention model, the model obtained in the pre-training stage is prevented from losing too much pre-training information when the intention model is trained, and the generalization capability of the intention model is stronger.

Those skilled in the art will appreciate that all or part of the steps to implement the above embodiments are implemented as a computer program executed by a CPU. When executed by the CPU, performs the functions defined by the methods provided herein. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.

Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the method according to exemplary embodiments of the present application, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

FIG. 6 is a block diagram illustrating a voice robot communicator employing highly generalized multitask intent recognition according to an example embodiment. As shown in fig. 6, the voice robot communicator 60 employing highly generalized multitask intention recognition includes: a text module 602, an identification module 604, a judgment module 606, an update module 608, and a call module 610.

The text module 602 is used for the voice robot and the user to establish a voice call, and acquiring real-time voice text data of the user in the call process;

the recognition module 604 is configured to input the real-time speech text data into a multitask intention recognition model to generate a replacement text, an intention category and a predicted text;

the judging module 606 is used for determining the accuracy of the multitask intention recognition model according to the replacement text and the prediction text;

an update module 608 to update a ditch call technique based on the intent classification when the accuracy rate is greater than a threshold;

the communication module 610 is configured to enable the voice robot to continue to perform voice communication with the user according to the updated communication technology.

According to the voice robot communication device adopting the high-generalization multitask intention recognition, voice communication is established between the voice robot and a user, and real-time voice text data of the user is acquired in the communication process; inputting the real-time voice text data into a multitask intention recognition model to generate a replacement text, an intention category and a prediction text; determining the accuracy of the multitask intention recognition model according to the replacement text and the prediction text; updating a channel conversation based on the intent classification when the accuracy rate is greater than a threshold; the voice robot can embed the multitask training method into the training process of the intention model according to the mode that the updated communication technology continues to carry out voice communication with the user, and prevents the model obtained in the pre-training stage from losing too much pre-training information when the intention model is trained, so that the generalization capability of the intention model is stronger, the intention of the user can be analyzed more accurately, and the voice communication with the user is smoother.

FIG. 7 is a block diagram of an electronic device including a processor 710, a communication interface 720, a memory 730, and a communication bus 740, wherein the processor 710, the communication interface 720, and the memory 730 communicate with each other via the communication bus 740, according to an example embodiment;

a memory 730 for storing a computer program;

the processor 710 is configured to implement the method for adjusting data distribution permissions based on video expression actions according to any of the embodiments described above when executing the program stored in the memory 730.

In the electronic device provided by the embodiment of the present invention, the processor 710 obtains the data allocation initial permission and the access information of the target by executing the program stored in the memory 730; determining video text content through the access information; establishing a real-time video link with the target, and displaying the video text content according to the video link to generate video data; recognizing expression actions of the user in the video data to determine a corresponding authority adjustment coefficient; and adjusting the data distribution authority of the user according to the initial authority and the authority adjustment coefficient.

The communication bus 740 mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 740 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface 720 is used for communication between the above-described electronic apparatus and other apparatuses.

The memory 730 may include a random access memory 730 (RAM), or may include a non-volatile memory 730 (non-volatile memory), such as at least one disk memory 730. Optionally, the memory 730 may also be at least one storage device located remotely from the processor 710.

The processor 710 may be a general-purpose processor 710, and includes a central processing unit 710 (CPU), a network processor 710 (NP), and the like; the device may also be a digital signal processor 710 (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.

Embodiments of the present invention provide a computer-readable storage medium storing one or more programs, which are executable by one or more processors 710 to implement the voice robot call method using highly generalized multi-task intention recognition according to any of the above embodiments. For example, a voice robot establishes a voice call with a user, and acquires real-time voice text data of the user in the call process; inputting the real-time voice text data into a multitask intention recognition model to generate a replacement text, an intention category and a prediction text; determining the accuracy of the intention category in the multitask intention recognition model according to the replacement text and the predicted text; updating a channel conversation based on the intent classification when the accuracy rate is greater than a threshold; and the voice robot continues to carry out voice communication with the user according to the updated communication technology.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The available media may be magnetic media (e.g., floppy disks, hard disks, tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid State Disks (SSDs)), among others.

Exemplary embodiments of the present application are specifically illustrated and described above. It is to be understood that the application is not limited to the details of construction, arrangement or method of operation set forth herein; on the contrary, the intention is to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A voice robot conversation method adopting high-generalization multitask intention recognition is characterized by comprising the following steps:

the voice robot establishes voice communication with a user, and acquires real-time voice text data of the user in the communication process;

inputting the real-time voice text data into a multi-task intention recognition model to generate a replacement text, an intention category and a prediction text;

determining the accuracy of the intention category in the multitask intention recognition model according to the replacement text and the predicted text;

updating a channel conversation based on the intent classification when the accuracy rate is greater than a threshold;

and the voice robot continues to carry out voice communication with the user according to the updated communication technology.

2. The method of claim 1, further comprising:

acquiring historical voice text data;

setting an intention category label for the historical voice text data;

training a multitask-based improved BERT language model through historical speech text data with intention category labels to generate the multitask intention recognition model.

3. The method of claim 2, wherein training a multitask-based refined BERT language model with historical speech text data with intent category labels to generate the multitask intent recognition model comprises:

training the BERT language model through historical voice text data with intention category labels to generate a pre-training model;

replacing partial characters in the historical voice text data to generate input text data;

inputting the input text data into the pre-training model improved based on multiple tasks to obtain a predicted text and a predicted intention classification;

and when the loss functions corresponding to the predicted texts and the predicted intention classifications meet a preset strategy, generating the multitask intention recognition model.

4. The method of claim 3, wherein training the BERT language model with historical speech text data with intent class labels generates a pre-trained model comprising:

performing word vector conversion on all characters in the historical voice text data to generate a word embedding tensor, a statement block tensor and a position coding tensor of each character;

generating a character vector of each character based on the word embedding tensor, the sentence blocking tensor and the position coding tensor;

and training the BERT language model through the character word vectors with the intention category labels to generate the pre-training model.

5. The method of claim 3, wherein performing a replacement operation on a portion of characters in the historical phonetic text data to generate input text data comprises:

extracting characters with a preset proportion to perform replacement operation;

storing partial characters before replacement operation;

the input text data is generated by character vectors of the unsubstituted characters and the substituted characters.

6. The method of claim 3, wherein inputting the input text data into the pre-trained model after the multi-tasking based refinement, resulting in predicted text and predicted intent classifications, comprises:

inputting the input text data to the pre-training model;

the pre-training model performs intention prediction on the input text based on a bidirectional coding mechanism;

the pre-training model identifies the replaced characters in the input text based on a multitasking mechanism;

and the pre-training model generates a prediction text and a prediction intention classification according to the calculation result.

7. The method of claim 3, wherein generating the multitask intent recognition model when a loss function corresponding to the predicted text and the predicted intent classification satisfies a preset policy comprises:

comparing the similarity of the predicted text and the replaced characters to generate a first comparison result;

comparing the predicted intention classification with the intention label to generate a second comparison result;

and when the first comparison result is larger than a text threshold and the second comparison result is larger than an intention threshold, generating the multitask intention recognition model according to the current parameters of the pre-training model.

8. The method of claim 1, wherein the voice robot establishes a voice call with the user, and acquiring real-time voice text data of the user during the call comprises:

determining a ditch conversation technique according to the user information;

the voice robot carries out voice call with the user based on the ditch call technology;

and converting the real-time voice data of the user into voice text data through voice recognition.

9. The method of claim 1, wherein determining an accuracy rate of the intent categories in the present multitask intent recognition model based on the alternative text and the predicted text comprises:

comparing the similarity of the replacement text and the predicted text;

and determining the accuracy of the idea category in the multi-task intention recognition model according to the similarity comparison result.

10. The method of claim 1, further comprising:

and when the accuracy is less than or equal to a threshold value, the voice robot continues to carry out voice communication with the user according to the original ditch communication technology.

11. A voice robot communicator employing highly generalized multitask intention recognition, comprising:

the text module is used for establishing voice communication between the voice robot and the user and acquiring real-time voice text data of the user in the communication process;

the recognition module is used for inputting the real-time voice text data into a multi-task intention recognition model to generate a replacement text, an intention category and a prediction text;

the judging module is used for determining the accuracy of the intention category in the multitask intention recognition model according to the replacement text and the prediction text;

an update module to update a ditch call technique based on the intent classification when the accuracy rate is greater than a threshold;

and the call module is used for continuing the voice call with the user by the voice robot according to the updated communication technique.

12. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-10.

13. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 10.