CN112154465A

CN112154465A - Method, device and equipment for learning intention recognition model

Info

Publication number: CN112154465A
Application number: CN201880093483.0A
Authority: CN
Inventors: 张晴; 杨威; 肖一凡; 张良和; 芮祥麟
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-09-19
Filing date: 2018-09-19
Publication date: 2020-12-29
Also published as: WO2020056621A1; EP3848855A1; US20210350084A1; EP3848855A4

Abstract

A learning method, a device and equipment of an intention recognition model relate to the technical field of communication, and are beneficial to improving the accuracy of the intention recognition model in a man-machine conversation system, improving the accuracy of a task executed by the man-machine conversation system and improving user experience, wherein the method comprises the following steps: the server receives forward data corresponding to a first skill input by a skill developer (S101); the server generates negative data corresponding to the first skill according to the positive data corresponding to the first skill (S102); the server determines a second skill similar to the first skill (S103); the server acquires data corresponding to each second skill (S104); the server generates a second base model according to the data corresponding to the second skill and the first base model stored by the server (S105); and the server learns according to the positive data and the negative data corresponding to the first skill and the second base model (S106) and generates an intention recognition model.

Description

Method, device and equipment for learning intention recognition model

Technical Field

The present application relates to the field of communications technologies, and in particular, to a method, an apparatus, and a device for learning an intention recognition model.

Background

Man-machine interactive systems, otherwise known as man-machine interactive platforms, chat robots (chatbots), etc., are a new generation of man-machine interactive interfaces. Specifically, the human-machine dialog system is divided into a chatbot for an open-domain (open-domain) and a chatbot for a specific task (task-oriented) according to the related fields.

Wherein a specific task oriented chatbot may implement functionality to provide services such as meal ordering, ticket ordering, car typing, etc. to an end user. For example: the providers of these services input some training data (for example, user's speech, which may also be called corpus) corresponding to the function a in the server in advance, and the server trains the model corresponding to the function a by using an Ensemble Learning (Ensemble Learning) method according to the input training data. The model corresponding to function a may be used to predict a new user utterance entered by an end user to determine the user's intent, i.e., whether the server provides the end user with the service corresponding to function a.

In the ensemble learning process of the server, the server learns the training data input by the service provider by using some preset base learners (also called base models), and integrates the trained models of the base learners by using a certain rule, so as to obtain a more accurate model than a single base learner.

Therefore, the richness and the accuracy of training data input by a service provider and the reasonability of a preset base model directly influence the accuracy of the model obtained after ensemble learning. If the input training data is less, or the input training data is inaccurate, or some unreasonable conditions occur in the base model preset by the server, the accuracy of the model obtained by ensemble learning is seriously influenced, so that the accuracy of chatbot in executing a specific task is influenced, and the experience of a terminal user is influenced.

Disclosure of Invention

The method, the device and the equipment for learning the intention recognition model can improve the accuracy of the intention recognition model in a man-machine conversation system, improve the accuracy of the man-machine conversation system in executing tasks and improve user experience.

In a first aspect, an embodiment of the present application provides a learning method of an intention recognition model, where the method includes: the server receives forward data corresponding to each intention in the first skill input by the skill developer; the server generates negative data corresponding to each intention in the first skill according to the positive data corresponding to each intention in the first skill; the server acquires training data corresponding to a second skill, wherein the second skill is similar to the first skill, and the number of the second skill is at least one; the server learns according to training data corresponding to the second skill and a preset first base model to generate at least one second base model; and the server learns according to the positive data corresponding to each intention in the first skill, the negative data corresponding to each intention in the first skill and the second base model to generate an intention identification model.

Therefore, in the embodiment of the application, negative data are introduced when the intention recognition model is trained, so that the situation of error recognition caused by training of only positive data is reduced, and the accuracy of the learned intention recognition model is improved. In addition, training data of other similar voice skills are introduced when the intention recognition model is trained, so that more basic models are learned, the types of the basic models are enriched, and the accuracy of the trained intention recognition model can be improved.

In one possible implementation manner, the generating, by the server, negative-direction data corresponding to each intention in the first skill according to the positive-direction data corresponding to each intention in the first skill includes: the server respectively extracts keywords corresponding to each intention in the first skill aiming at each intention in the first skill, wherein the keywords are key features influencing the weight of the first base model; combining keywords corresponding to different intentions in the first skill, or combining keywords corresponding to different intentions with words related to the non-first skill, and determining the combined words as negative data corresponding to different intentions.

The above-described keyword is a significant feature that most easily affects the feature weight of classification (recognition intention). In the prior art, the reason for the misrecognition is that the salient features of other classes (other intentions) are highly overlapped with the salient features of the class (the intention). For example: intent 1 is to open the app and intent 2 is to close the app. However, the "app" of fig. 1 and 2 is a salient feature with a high frequency, but is not a salient feature of each classification, but if the salient feature with the high frequency is still used in the classification, it is easy to cause a false recognition. To this end, the embodiments of the present application improve the accuracy by reducing the weight of the salient features contained in each of these classes (intents) during classification.

In one possible implementation manner, the generating, by the server, negative-direction data corresponding to each intention in the first skill according to the positive-direction data corresponding to each intention in the first skill further includes: the server determines a first set from a full set of training data stored on the server according to the classification of the first skill, wherein the training data in the first set comprises negative data corresponding to each intention in the first skill; wherein the training data comprises positive data and negative data; sampling a preset amount of training data from a first set; and determining negative data corresponding to each intention in the first skill from the training data obtained by sampling by adopting a manual labeling and/or clustering algorithm.

In one possible implementation manner, the obtaining, by the server, data corresponding to the second skill includes: the server determines a second skill according to the classification of the first skill and/or the forward data of the first skill; and acquiring training data corresponding to the second skill through the sharing layer.

Because training data of other similar voice skills are introduced when the intention recognition model is trained, the problem that a first skill developer inputs less training data and is difficult to learn a more accurate intention recognition model is solved.

In a possible implementation manner, the learning by the server according to the training data corresponding to the second skill and a preset first base model, and the generating the second base model includes: and the server learns by adopting a multi-task learning method according to the training data corresponding to the second skill and the preset first base model to generate a second base model.

In one possible implementation manner, the server learns according to the positive data corresponding to each intention in the first skill, the negative data corresponding to each intention in the first skill, and the second base model, and the generating the intention recognition model includes: and the server learns by adopting an integrated learning method according to the positive data corresponding to each intention in the first skill, the negative data corresponding to each intention in the first skill and the second base model to generate an intention identification model.

In one possible implementation, the server generating the intention recognition model specifically includes: the server generates a plurality of intermediate models in the process of generating the intention recognition model; the server determines the first generated intermediate model as the fastest intention recognition model; and/or after the server selects the models and adjusts the parameters of the plurality of intermediate models, determining the optimal intention recognition model.

Embodiments of the present application provide a variety of return mechanisms, such as: returning the fastest generated intent recognition model, returning the best intent recognition model, and the like. For example: when the fastest generated intention recognition model is returned, since the server 300 obtains a plurality of intermediate models in the process of learning the intention recognition model, and the intermediate models can also realize the functions of the intention recognition model, the intermediate model obtained first can be determined as the intention recognition model returned to the voice skill developer or the user, so that the voice skill developer or the user can acquire the intention recognition model at the fastest speed to know the functions, performances and the like of the intention recognition model. Another example is: when the optimal intention recognition model is returned, the server 300 may adjust parameters of the obtained plurality of intermediate models in the process of learning the intention recognition model, so as to determine a model with higher accuracy as the intention recognition model. Can meet various requirements of users.

In one possible implementation, the server determining the best intention recognition model includes: the server calculates a first accuracy rate of each intermediate model according to forward data corresponding to each intention in the first skill; the server calculates a second accuracy rate of each intermediate model according to negative data corresponding to each intention in the first skill; and the server selects and adjusts parameters of the plurality of intermediate models according to the first accuracy, the second accuracy and the weight input by the skill developer, and then determines the optimal intention recognition model.

In the embodiment of the present application, not only the corpora within the speech skill are considered, for example: the skill developer inputs training data to test the accuracy of the model, and also considers determining some linguistic data beyond the phonetic skill by the server, such as automatically generated negative data, and the accuracy of the tested model. When the accuracy of the model is tested, more test data are introduced, the test result is more accurate, and then the optimal intention recognition model can be obtained after the parameters in the model are adjusted according to the more accurate test result. That is to say, the technical scheme that this application embodiment provided is favorable to improving the accuracy of intention recognition model.

In a second aspect, an embodiment of the present application provides a server, including: a processor, a memory, and a communication interface; the memory is for storing computer program code, the computer program code including computer instructions that, when read from the memory by the processor, cause the server to:

receiving forward data corresponding to each intention in a first skill input by a skill developer through a communication interface; generating negative data corresponding to each intention in the first skill according to the positive data corresponding to each intention in the first skill; acquiring training data corresponding to a second skill, wherein the second skill is similar to the first skill, and the number of the second skill is at least one; learning according to training data corresponding to the second skill and a preset first base model to generate at least one second base model; and learning according to the positive data corresponding to each intention in the first skill, the negative data corresponding to each intention in the first skill and the second basic model to generate an intention identification model.

In a possible implementation manner, the step of generating, by the server, negative-direction data corresponding to each intention in the first skill according to the positive-direction data corresponding to each intention in the first skill specifically includes: the server respectively extracts keywords corresponding to each intention in the first skill aiming at each intention in the first skill, wherein the keywords are key features influencing the weight of the first base model; combining keywords corresponding to different intentions in the first skill, or combining keywords corresponding to different intentions with words related to the non-first skill, and determining the combined words as negative data corresponding to different intentions.

In a possible implementation manner, the generating, by the server, negative-direction data corresponding to each intention in the first skill according to the positive-direction data corresponding to each intention in the first skill further specifically includes: the server determines a first set from a full set of training data stored on the server according to the classification of the first skill, wherein the training data in the first set comprises negative data corresponding to each intention in the first skill; wherein the training data comprises positive data and negative data; sampling a preset amount of training data from a first set; and determining negative data corresponding to each intention in the first skill from the training data obtained by sampling by adopting a manual labeling and/or clustering algorithm.

In a possible implementation manner, the generating, by the server, the intention recognition model further specifically includes: the server generates a plurality of intermediate models in the process of generating the intention recognition model; the server determines the first generated intermediate model as the fastest intention recognition model; and/or after the server selects the models and adjusts the parameters of the plurality of intermediate models, determining the optimal intention recognition model.

A third aspect, a computer storage medium comprising computer instructions which, when run on a server, cause the server to perform a method as described in the first aspect and any one of its possible implementations.

A fourth aspect, a computer program product, which, when run on a computer, causes the computer to perform the method as described in the first aspect and any one of its possible implementations.

Drawings

Fig. 1 is a first schematic structural diagram of a human-machine interaction system according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a prior art learning method for an intent recognition model;

FIG. 3 is a schematic diagram of an intention recognition model usage scenario provided by an embodiment of the present application;

fig. 4 is a schematic structural diagram of a human-machine interaction system according to an embodiment of the present disclosure;

FIG. 5 is a first schematic diagram illustrating a learning method of an intention recognition model according to an embodiment of the present disclosure;

fig. 6 is a second schematic diagram illustrating a learning method of an intention recognition model according to an embodiment of the present application;

fig. 7 is a third schematic diagram of a learning method of an intention recognition model according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application;

FIG. 9 is a flowchart illustrating a learning method of an intention recognition model according to an embodiment of the present application;

fig. 10 is a fourth schematic diagram of a learning method of an intention recognition model according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. In the description of the embodiments herein, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone.

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present application, "a plurality" means two or more unless otherwise specified.

Fig. 1 is a schematic diagram illustrating a man-machine interaction system according to an embodiment of the present disclosure. The man-machine conversation system includes: one or more electronic devices 100, one or more servers 200, and one or more servers 300. A communication connection is established between the electronic device 100 and the server 300. The server 300 establishes communication connections with the electronic device 100 and the server 200, respectively. Optionally, the electronic device 100 may also establish a communication connection with the server 200. The communication connection may be established by using a telecommunication network (communication network such as 3G/4G/5G) or a WIFI network, which is not limited in this embodiment of the present application.

The electronic device 100 may be a mobile phone, a tablet Computer, a Personal Computer (PC), a Personal Digital Assistant (PDA), a smart watch, a netbook, a wearable electronic device, an Augmented Reality (AR) device, a Virtual Reality (VR) device, a vehicle-mounted device, a smart car, a smart sound, a robot, and the like, and the specific form of the electronic device is not particularly limited by this application.

The server 200 may be a server of a third party application for providing a service of the third party application. The third party application may be, for example, a mezzo, amazon, a drip vehicle, or the like.

The server 300 may be a server of a manufacturer of the electronic device 100, for example, a cloud server of a voice assistant in the electronic device 100, and the server 300 may also be another server, which is not limited in this embodiment.

Taking the example of a third-party application developing a new voice application or function (which may be referred to as voice skill), a man-machine conversation scene of the application of the technical scheme provided by the embodiment of the present application is explained.

The voice skill may refer to a function that a user may request a third-party application to provide one or more services in the application by using a conversational interaction between the electronic device 100 and the server 200 of the third-party application. The interactive process simulates the interactive scene in the real life of the user, so that the interaction between the user and the electronic equipment is natural as human interaction.

In the above conversational interaction process, each word spoken by the user corresponds to an intention, which is the purpose of the user to speak the word. It should be noted that each voice skill is composed of a plurality of intentions, and the server 300 learns the needs of the user by matching each word spoken by the user with the intentions in the voice skill, and provides corresponding services, such as meal ordering, ticket ordering, car-taking and the like.

In the above process of matching according to the words spoken by the user and the intention in the speech skills, that is, in the process of intention recognition, the server 300 may construct an intention recognition model, and the intention recognition model may automatically recognize the intention corresponding to the user utterance specifically according to the user utterance input by the user.

As shown in FIG. 2, a schematic diagram of a process for training an intent recognition model for a server 300 is shown. Specifically, a developer of voice skills, such as the server 200, needs to input some training data (which may include the correspondence between the user utterance and the user intention) corresponding to new voice skills to the server 300. The server 300 may obtain an intention recognition model corresponding to the new speech skill by using a corresponding learning method, such as an ensemble learning method, according to the training data. For example: and inputting the corpus A corresponding to the voice skill 1 into the model learning frame 1 for learning to obtain an intention recognition model corresponding to the voice skill 1. The model learning framework 1 may be, for example, an ensemble learning framework, where the smallest learning unit in the model learning framework is a base model, and the ensemble learning framework includes a plurality of base models, which are denoted as the base models 1. These base models 1 are different from each other, for example: the different base models 1 may belong to different models, such as Support Vector Machines (SVMs) and Logistic Regression (LR) models; or belong to the same model, but use different hyper-parameters (for example, SVM support vector machine models using different values of the hyper-parameter C); but may also belong to the same model but use different data (e.g., the same SVM model uses different subsets of raw data, etc.).

In addition, as to the specific technical solution of the ensemble learning method related in the embodiment of the present application, refer to the patent application with the application number CN 102521599a, entitled "a pattern training and recognition method based on ensemble learning", which is published by the intellectual property office of china. Reference may also be made to the patent application with the application number CN107491531A, entitled "chinese network comment emotion classification method based on ensemble learning framework", and other implementation manners of the ensemble learning method in the prior art, which are published by the intellectual property office of china, and the embodiments of the present application are not limited.

After training of the intention recognition model corresponding to the new speech skill is completed, after the electronic device 100 receives the utterance input by the user, the utterance input by the user is sent to the server 300, and the server 300 predicts the utterance according to the utterance of the user and the trained intention recognition model to determine the intention of the user corresponding to the utterance of the user, and further determine a third-party application and a specific function in the third-party application, where the user provides a service.

For example: as shown in fig. 3, a schematic diagram of a process for determining to serve a third party application to a user for a server 300 is shown. It is assumed that the server 300 stores, in addition to the newly learned intention recognition model corresponding to the voice skill 1 (one of the voice skills belonging to the third party application 1), an intention recognition model corresponding to the voice skill 2 of the third party application 1, an intention recognition model corresponding to the voice skill 3 of the third party application 2, and the like.

Specifically, the user inputs speech 1 through the electronic device 100, the electronic device 100 may convert speech 1 into corresponding text, that is, the user says 1, and send the text to the server 300, or the electronic device 100 may directly send speech 1 to the server 300, and the server 300 converts speech 1 into text (user says 1). The server 300 distributes the user calligraphy 1 to the intention recognition models corresponding to the voice skills, and the intention recognition models corresponding to the voice skills perform calculation to determine that the user saying 1 corresponds to the voice skill 1, that is, the service corresponding to the voice skill 1 applied by the third party needs to be provided for the user.

In some embodiments, after determining the third-party application providing the service for the user, the server 300 may return information of the determined third-party application to the electronic device 100, where the information of the third-party application may include address information of a server of the third-party application, and the information of the third-party application may further include information of a specific service that needs to be provided for the user, for example: order or query orders, etc. Then, the electronic device 100 may establish a communication connection with the server 200 of the third party application, and the server 200 of the third party application provides a corresponding service for the user. In other embodiments, the server 200 may establish a communication connection with the server 200 of the corresponding third-party application, and send the service request information of the user to the server 200 of the third-party application, and the server 200 of the third-party application may interact with the electronic device 100 through the server 300 to provide the corresponding service for the user. The service request information may include: information of the specific service requested by the user, for example: order or query orders, etc. The embodiment of the present application is not limited to a specific manner in which the server 200 of the third-party application provides a service for the user of the electronic device 100.

It should be noted that, as shown in fig. 4, a schematic diagram of another man-machine interaction system provided in the embodiment of the present application is provided. The man-machine conversation system includes: one or more electronic devices 100 and one or more servers 200. The electronic device 100 and the server 200 may refer to the related description in fig. 1.

The difference is that the above-described process of learning the intention recognition model corresponding to the new voice skill based on the training data corresponding to the new voice skill input by the voice skill developer may be executed on the server 200, or may be executed on the electronic device 100 when the computation function of the electronic device 100 can support the computation of the process, and the method provided in the embodiment of the present application does not limit the execution subject of the training intention recognition model. If the model for training intent recognition is performed by the server 200 of the third party application, then the server 200 of the third party application trains an intent recognition model for the third party application's own plurality of speech skills. The following description will be given taking an example in which the server 300 executes the training intention recognition model.

It should be noted that, in the process of training the intention recognition model by the server 300, when the voice skill developer cannot provide enough training data, or the voice skill developer cannot guarantee the accuracy of the training data, the accuracy of the target model trained by the server 300 cannot be guaranteed, and thus the accuracy of recognizing the intention of the user by the server 300 is affected, and the user experience of the electronic device is affected. Due to the fact that the scenes and the types of the voice skills are various, the basic model preset by the server 300 inevitably has some unreasonable situations, such as: not applicable to certain scenarios or types of speech skills, which also affects the accuracy of the server 300 in training the target model.

It should be noted that the server 300 may provide a common human-machine conversation platform for a plurality of third-party applications, and a developer or a maintainer of the human-machine conversation platform may preset some basic skill templates (including some preset training data, base models used in learning, and the like) on the human-machine conversation platform, where the skill templates cover part of common usage scenarios. Then, the third-party application can modify the basic skill models to realize the personalized requirements of the third-party application, or the third-party application adds the customized skills according to the service content of the third-party application.

Therefore, the technical scheme provided by the embodiment of the application can be applied to the process of learning the intention recognition model corresponding to the new voice skill by the server 300, and the accuracy of the intention recognition model learned by the server 300 can be improved on the basis of the prior art.

Fig. 5 is a schematic diagram illustrating a method for learning an intention recognition model according to an embodiment of the present application. The server 300 may automatically generate some other training data based on the input training data corresponding to the new speech skill. Then, the training data input by the voice skill developer and the other automatically generated training data are input to the server 300, and are learned by the server 300. Because more training data are introduced in the learning process, the accuracy of the intention recognition model obtained after the learning of the server 300 can be improved.

For example: the input training data of the voice skill 1 comprises a corpus A, the server generates negative data, namely a corpus B, according to positive data in the corpus A, and then the corpus A and the corpus B are input into a model learning frame together for learning, so that an intention recognition model corresponding to the voice skill 1 is obtained. The forward data is the user's spoken data that can trigger the server 300 to perform the corresponding operation. Negative data is some data that is similar to the positive data, but not positive data, that is, data that is spoken by a user whose server 300 does not perform the corresponding operation.

In other embodiments of the present application, the server 300 may learn the intention recognition models corresponding to different voice skills of the same third-party application, and may also learn the intention recognition models corresponding to different voice skills of different third-party applications. That is, the server 300 may store intent recognition models for different voice skills of the same third-party application, as well as intent recognition models corresponding to different voice skills of different third-party applications. Considering that the intention recognition models corresponding to various subject and various types of voice skills can be stored on the server 300, and there may be some similarities among a large number of voice skills, the corpora corresponding to the voice skills can also be multiplexed. Therefore, the technical solution provided in the embodiment of the present application may also be used to perform training in combination with training data of other voice skills similar to a certain voice skill when training an intention recognition model corresponding to the voice skill. For example: the server 300 may augment the base model in the model learning framework with training data for other voice skills similar to the voice skill. The other voice skills similar to the voice skill may belong to the same third-party application as the voice skill, or may belong to a different third-party application as the voice skill, and the type and the number of the similar voice skills are not limited in the embodiment of the present application.

For example: fig. 6 is a schematic process diagram of another learning intention recognition model provided in the embodiment of the present application. The server 300 determines that voice skills 2 and 3 are other voice skills similar to voice skill 1.

The server 300 may input the corpus C corresponding to the voice skill 2 into the model learning framework 2 to learn to obtain one or more base models 2, and the server 300 inputs the corpus D corresponding to the voice skill 3 into the model learning framework 2 to learn to obtain one or more base models 2. The model learning framework 2 includes one or more base models 1 pre-stored in the server 300. The plurality of base models 2 obtained by the model learning framework 2, i.e., the base models expanded by the server 300, may be used as the base models in the model learning framework 3 according to the corpus C and the corpus D. The model learning framework 2 may adopt a multitask learning framework, and the model learning framework 3 may adopt an ensemble learning framework, for example.

The server 300 may input the corpus a corresponding to the voice skill 1 input by the skill developer and the automatically generated corpus B into the model learning framework 3 for learning, so as to obtain the intention recognition model corresponding to the voice skill 1. Optionally, the server 300 may also directly input the corpus a corresponding to the voice skill 1 input by the skill developer into the model learning frame 3 for learning, so as to obtain the intention recognition model corresponding to the voice skill 1. The embodiment of the present application does not limit this.

Therefore, when the intention recognition model is trained, training data of other similar voice skills are introduced, more basic models are learned, the types of the basic models are enriched, and the accuracy of the trained intention recognition model can be improved.

In further embodiments of the present application, as shown in fig. 7, a schematic process diagram for learning an intention model for a further server 300 provided in the embodiments of the present application is shown. Specifically, in the process of learning the intention identification model by the server 300, a plurality of intermediate models are obtained sequentially, and the server 300 performs model selection and parameter adjustment on the intermediate models to obtain the finally generated intention identification model with the highest accuracy. It should be noted that these intermediate models can all implement the function of the intention recognition model corresponding to the speech skill 1. In view of the different requirements of the speech skill developers, the embodiments of the present application provide a variety of return mechanisms, such as: returning the fastest generated intent recognition model, returning the best intent recognition model, and the like. For example: when the fastest generated intention recognition model is returned, since the server 300 obtains a plurality of intermediate models in the process of learning the intention recognition model, and the intermediate models can also realize the functions of the intention recognition model, the intermediate model obtained first can be determined as the intention recognition model returned to the voice skill developer or the user, so that the voice skill developer or the user can acquire the intention recognition model at the fastest speed to know the functions, performances and the like of the intention recognition model. Another example is: when the optimal intention recognition model is returned, the server 300 may adjust parameters of the obtained plurality of intermediate models in the process of learning the intention recognition model, so as to determine a model with higher accuracy as the intention recognition model.

Due to the technical solution provided by the embodiment of the present application, when training an intention recognition model corresponding to a certain speech skill, training data of other speech skills similar to the speech skill are combined, the embodiment of the present application further provides a standard for determining an optimal model, that is, not only corpora within the speech skill are considered, for example: the skill developer inputs training data to test the accuracy of the model, and also considers determining some linguistic data beyond the phonetic skill by the server, such as automatically generated negative data, and the accuracy of the tested model. When the accuracy of the model is tested, more test data are introduced, the test result is more accurate, and then the optimal intention recognition model can be obtained after the parameters in the model are adjusted according to the more accurate test result. That is to say, the technical scheme that this application embodiment provided is favorable to improving the accuracy of intention recognition model. The specific implementation process will be described in detail below.

As shown in fig. 8, which is a schematic diagram of a hardware structure of a server 300 according to an embodiment of the present disclosure, the server 300 includes at least one processor 301, at least one memory 302, and at least one communication interface 303. Optionally, the server 300 may further include an output device and an input device, not shown in the figure.

The processor 301, the memory 302, and the communication interface 303 are connected by a bus. The processor 301 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application-Specific Integrated Circuit (ASIC), or one or more ics for controlling the execution of programs in accordance with the present invention. The processor 301 may also include multiple CPUs, and the processor 301 may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, or processing cores that process data (e.g., computer program instructions).

In this embodiment, the processor 301 may be specifically configured to automatically generate negative data of a new voice skill according to training data of the new voice skill input by the voice skill developer. The processor 301 may be further specifically configured to determine other voice skills similar to the new voice skill, obtain data of the other voice skills, and extend the base model in the model learning framework. The processor 301 may also learn to obtain the intention recognition model according to the augmented base model and training data input by the developer, and negative direction data automatically generated by the processor 301. The processor 301 may be further specifically configured to consult a plurality of intermediate models generated during the learning process in order to generate an optimal intent recognition model, and the like.

The Memory 302 may be a Read-Only Memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Compact Disc Read-Only Memory (CD-ROM) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 302 may be a separate device and connected to the processor 301 through a bus. The memory 302 may also be integrated with the processor 301. The memory 302 is used for storing application program codes for executing the scheme of the application, and the processor 301 controls the execution. The processor 301 is configured to execute the computer program code stored in the memory 302, thereby implementing the method for data transmission described in the embodiments of the present application.

In this embodiment of the application, the memory 302 may be configured to store data of a base model preset in the learning framework in the server 300, and may also be configured to store negative direction data corresponding to each speech skill automatically generated by the server 300, each intermediate model generated by the processor 301 in the learning process, various data for training or testing, and an intention recognition model corresponding to each learned speech skill.

Communication interface 303 may be used to communicate with other devices or communication networks, such as ethernet, Wireless Local Area Networks (WLAN), etc.

In this embodiment, the communication interface 303 may be specifically configured to communicate with the electronic device 100, so as to enable interaction with a user of the electronic device. The communication interface 303 may also be specifically used for communicating with the server 200 of a third party application, such as: the server 300 may receive training data corresponding to the new voice skill input by the third-party application server 200, or the server 300 may send the determined user service request to the third-party application server 200, so that the third-party application server 200 provides a corresponding service for the user of the electronic device, and the like.

An output device is in communication with the processor and may display information in a variety of ways. For example, the output device may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) Display device, a Cathode Ray Tube (CRT) Display device, a projector (projector), or the like. The input device is in communication with the processor and may receive user input in a variety of ways. For example, the input device may be a mouse, a keyboard, a touch screen device, or a sensing device, among others.

The technical solutions involved in the following embodiments can be implemented in the server 300 having the above hardware architecture.

As shown in fig. 9, a schematic flow chart of a learning method of an intention recognition model provided in the embodiment of the present application specifically includes:

s101, the server receives forward data corresponding to the first skill input by the skill developer.

The skill developer may be a developer of a third-party application, or may also be a manufacturer of the electronic device, and the embodiment of the present application is not limited.

In some embodiments of the present application, the server provides a platform for the skill developer, and the skill developer may log on the platform of the skill developer through a telecommunication network (a communication network such as 3G/4G/5G) or a WIFI network, register an account, and start to create a new voice skill, that is, the first skill. The skill developer may input basic information such as a name of the first skill, a classification of the first skill, and the like, and then input intentions included in the first skill, and forward data corresponding to the respective intentions. Wherein, the forward data corresponding to each intention comprises: the server needs to perform corresponding operations after responding to the received forward data, such as: triggering a corresponding service, or returning a corresponding dialog to the electronic device, etc. That is, the data spoken by the user, which can trigger the server to perform the corresponding operation, is called forward data, and may also be called forward corpus, forward example data, and skill data.

In other embodiments of the present application, the skill developer may also input some negative data corresponding to each intention when inputting positive data corresponding to each intention of the first skill. Where negative-going data is such data that is similar to, but not positive-going data. The negative data may also be referred to as negative corpora, negative examples, negative example data, extra-skill data, and the like, and the embodiment of the present application is not limited.

For example: the skill developer is a beauty application, and the first skill developed by the beauty application is a meal ordering service. The masquerade application may input forward data corresponding to the ordering service, for example: order take-out, order meal, etc. The beauty applications may also input negative data, such as: booking an airline ticket, etc.

S102, the server generates negative data corresponding to the first skill according to the positive data corresponding to the first skill.

Wherein the first skill comprises a plurality of intents, each of which may correspond to one or more forward data. For each intent, the server may generate corresponding negative-going data from the positive-going data for each intent. Thus, the accuracy of the intention recognition model trained later is improved.

This is because, in the prior art, when the training data input to the server is only one type (forward data), and the intention recognition model trained by the server is used for judging the utterance of a certain user, the judgment standard is only a threshold set by a skill developer, that is, when the confidence of a certain utterance is lower than the threshold, the utterance is considered not to be the intention in the first skill. When the confidence of a term reaches the threshold, the term is considered an intent within the first skill. In practical scenarios, the threshold value of different voice skills is difficult to set, because the threshold value depends on factors such as similarity between different voice skills. In the embodiments of the present application, other types of training data are introduced, such as: the introduced negative data can also be subjected to category judgment, so that some situations of false recognition can be avoided, and the accuracy of the trained intention recognition model can be improved.

For example, the present application provides two methods for generating negative data corresponding to a first skill according to positive data corresponding to the first skill, as follows:

according to the first method, aiming at each intention in the first skill, the server respectively extracts keywords of each intention from positive data corresponding to each intention, and combines the extracted keywords with other irrelevant words to form negative data corresponding to each intention.

The extraction of the keywords in the forward data may be performed manually by a skill developer, or may be performed by using a related mining technology for sampling the keywords. For example, the keyword in the forward data is also a feature in the forward data, so a feature selection method, such as an lr (logistic regression) method based on L1 regularization, may be used for extraction. The embodiment of the present application may also adopt a wrapping type (wrapper) feature selection method, for example: lvw (las Vegas wrapper), sparse representation methods based on dictionary learning, and simple Heuristic algorithms (Heuristic Algorithm), etc. The embodiment of the present application does not limit the specific method for extracting the keywords.

For example, taking an LR method based on L1 regularization as an example, a process of extracting keywords of forward data is briefly described. Assuming that the forward data corresponds to K categories (i.e. the forward data input by the user corresponds to K categories, i.e. one category can be understood as one category), K "1 vs the rest" categories are constructed respectively, and features with small influence on the category weight can be ignored through the L1 norm, so as to retain features with influence on the category. Considering that for each forward data, the data are respectively converted into 2 classifications, and the features are 0/1 discrete features, the significant features, namely the keywords with the largest weight for each intention, can be obtained by screening and sorting through the weight signs and the absolute value sizes.

The irrelevant words can be words (or word combinations or sentences and the like) which are similar to the original forward data in form but have different semantics after being combined with the extracted keywords, or words (or word combinations or sentences and the like) which are irrelevant to the semantics of the original forward data. For example, a sentence similarity method may be used to determine whether the combined word (or word combination or sentence, etc.) is not separated or related from the original forward data semantics.

For example, the server may randomly combine keywords corresponding to different intentions to generate negative data. The server may also randomly combine keywords in each intent with some that are not relevant to the first skill, such as solid words, to generate negative data. The embodiment of the application does not limit the method for selecting the irrelevant words by the server and the mode for combining the keyword of each intention with other irrelevant words.

For example: the forward data of a certain skill comprises an "open menu" and a "close menu", which respectively correspond to two intentions, namely intention one and intention two. After extracting the keywords of the forward data, the keywords of the intention one are turned on, and the keywords of the intention two are turned off. Then, the "open/close" obtained by combining the keywords of different intentions, and the "open drawer", "close WeChat" obtained by combining the keyword of each intention with other irrelevant words, and the like constitute negative data of the skill.

In the second method, in the embodiment of the application, the server stores a large amount of training data of the voice skills, including data of different voice skills applied by the same third party and data of different voice skills applied by different third parties. These massive training data may belong to different domains, of different classifications, and constitute a whole set of training data on the server. For a first skill, the server may determine a first set from a corpus of training data stored on the server according to a classification of the first skill, where the corpora in the first set are sets of corpora that may conflict with forward data of the first skill. In determining the first set, an intra-class classifier-based approach or a similarity search-based approach may be employed, for example: and determining the linguistic data which conflicts with the forward data or determining suspicious linguistic data by reverse index, sentence similarity calculation and the like.

It should be noted that, when determining the first set, the server may directly determine the first set from the full set of training data, or may gradually determine the first set from the full set of training data, that is, the server may determine a set a from the full set of training data, where the set a is smaller than the full set of training data. And determining a set B in the set A, wherein the set B is smaller than the set A, and so on until the first set is determined. The embodiment of the present application does not limit the specific method for determining the first set by the server.

The corpus in the first set may be a corpus similar to the forward data, for example, a corpus containing keywords same as or similar to the forward data, or a corpus containing entity words (e.g., nouns) same as or similar to the forward data. For example: the forward data is "open menu", and the keyword of the corpus is assumed to be "open". Then, the first set may include, for example: "turn on radio", "turn on menu", "hide menu", etc.

The server may sample a certain number of corpora from the first set, and further determine whether the adopted corpora are negative data corresponding to the first skill.

The above-mentioned certain number (i.e. the number of the sampled corpora) may be determined according to the number of the forward data. For example: the number of the sampled corpora may be determined to be the same as the number of the forward data, or may be the maximum corpus number of the forward data corresponding to each intention, or the average corpus number, or a preset number, and the like.

In the sampling process, a random sampling mode or a sampling mode according to theme distribution and the like can be adopted, and the specific sampling mode is not limited in the embodiment of the application.

The server determines whether the sampled corpus is negative data of the first skill, and can also understand that the sampled corpus is classified, if the semantic meaning of the sampled corpus is the same as that of the positive data, the sampled corpus is positive data, and if the semantic meaning of the sampled corpus is different from that of the positive data, the sampled corpus is negative data. And determining whether the sampled corpus is negative data of the first skill or not by adopting a manual definition mode or a clustering algorithm and manual iteration mode. The clustering algorithm may be hierarchical clustering (hierarchical clustering) or a text clustering classification method based on Topic Model (Topic Model), etc. The clustering algorithm and the manual iteration mode may be, for example: manually labeling features that cannot be identified by some clustering algorithms, for example: the vividness of the color and the like, and whether the suspicious corpuses cannot be determined by some clustering algorithms are negative data or not is manually determined. The clustering algorithm can also learn the rule of manual labeling, and then the rule is utilized to determine whether the other sampled corpora are negative data or not.

When the server generates negative data from the positive data, the server may adopt any one of the above methods, or use both methods, or use other methods, and the embodiments of the present application are not limited to the method for generating negative data from the positive data.

S103, the server determines a second skill similar to the first skill, wherein the second skill is at least one.

It should be noted that, considering that a large amount of data of different voice skills are stored on the server, and the data storage similarity between the large amount of different voice skills, the server may share the data between the different voice skills. Specifically, the server may share data of each voice skill after obtaining the consent of each voice skill developer. The sharing may be based on classification, for example: data may be shared between different voice skills in the same category and may not be shared between different voice skills in different categories. In some embodiments, the server may add a sharing layer between voice skills that may share data, through which different voice skills may obtain data for other voice skills. In other embodiments, the server may further have some data built therein, and the data may be used by different voice skills, so that when the server learns the intention recognition model corresponding to a new voice skill, if the voice skill developer does not input training data corresponding to the voice skill or inputs less training data, the server may also learn to obtain the intention recognition model by sharing data of other voice skills and the data built in the server.

For this reason, in steps S103 and S104, the server may determine other second skills similar to the first skill on the server according to, for example, classification of the first skill, or similarity between training data of the first skill and training data of other skills, and the like, where the second skill may be other voice skills developed by developers of the first skill, and may also be other voice skills developed by developers of other voice skills, and the embodiment of the present application does not limit the extraction method and the extraction number of the second skill.

And S104, the server acquires data corresponding to each second skill.

The server may obtain, for example, training data corresponding to each second skill through the sharing layer, and may include positive data and negative data, specifically, the positive data and the negative data input by the developer of the second skill may be included, or the negative data automatically generated by the server in the process of learning the intention recognition model or other models corresponding to the second skill may also be included.

And S105, the server generates a second base model according to the data corresponding to the second skill and the first base model stored by the server.

When the server learns the intention recognition models corresponding to the skills, the preset model learning framework comprises one or more used first base models, organization relations (which may include weights and hierarchical relations among the first base models) among the first base models, and the like.

In the embodiment of the application, the server inputs the linguistic data corresponding to each second skill into the first base model respectively, and learns the linguistic data into a plurality of second base models. In some embodiments, for example, a multitask learning method and the like may be adopted, wherein the multitask learning method may be implemented based on models such as a deep learning Convolutional Neural Network (CNN), a bidirectional Long-Short Term Memory network (LSTM), and the like. In other embodiments, countertraining can be added in the process of multi-task learning, so that the shared data can be further purified, the accuracy of the second base model is improved, and the accuracy of the intention recognition model is further improved.

For example: as shown in fig. 6, the first skill is a speech skill 1, and the input training data of the first skill is corpus a. Assume that the server determines that voice skills that are similar to voice skill 1 are voice skill 2 and voice skill 3. The training data corresponding to the voice skill 2 is corpus C, and the training data corresponding to the voice skill 3 is corpus D.

The server inputs the corpus C into a model learning framework 2 for learning, wherein the model learning framework 2 comprises one or more first base models, namely the base model 1. The model learning framework 2 may be, for example, a multitask learning framework, and specifically, the corpus C is input into the multitask learning framework to generate a plurality of second base models, i.e., a plurality of base models 2. In this case, the server is equivalent to task learning for the voice skill 2. And inputting the corpus D into the multi-task learning framework to generate a plurality of second base models. In this case, the server is equivalent to task learning for voice skill 3.

That is, when learning the intention recognition model corresponding to voice skill 1, the server also introduces training data of other voice skills (voice skill 2 or voice skill 3), and performs multitask joint training on the training data of the other voice skills, thereby generating a second base model of a different representation space. And inputting second base models of different representation spaces into the integrated learning framework for learning, and finally obtaining the intention recognition model. Compared with the prior art, the method for simply carrying out feature transformation on the training data or replacing the first base model is more favorable for improving the accuracy of the intention recognition model corresponding to the learned voice skill.

And S106, the server learns according to the positive data and the negative data corresponding to the first skill and the second base model. If it is determined according to the policy that the fastest intention recognition model is returned, step S107 is performed, and if it is determined according to the policy that the best intention recognition model is returned, step S108 is performed.

The policy can be a policy preset by the server, namely which model is returned or two models are returned; the strategy recommended by the server to the user can also be a strategy recommended by the server to the user, namely different models can be recommended to the user according to different situations. The server may also receive a return policy input by the user, and determine to return to the corresponding model, which is not limited in the embodiment of the present application.

In some embodiments of the present application, the server may use an ensemble learning method for learning, for example: and (3) a stacking integrated learning method. In the process of server learning, some intermediate models are generated successively.

For example: fig. 10 is a schematic process diagram of another learning intention recognition model provided in the embodiment of the present application. When the model learning framework 3 adopts a stacking integrated learning framework, the server inputs positive data and negative data corresponding to the first skill into the stacking integrated learning framework, learns the intermediate model of the first layer, and records the intermediate model as the intermediate model 1. Then, the server inputs the features obtained by the intermediate model of the first layer into a stacking ensemble learning framework, learns the intermediate model of the second layer, and so on until the finally generated intention recognition model.

And S107, the server determines the fastest intention recognition model.

In the learning process of step S106, the server may sequentially generate a plurality of intermediate models, and each of the intermediate models may independently implement the function of the intention recognition model corresponding to the first skill. Thus, based on the needs of the skill developer, the server may first return some intermediate models as the fastest generating models. In some embodiments, the server may determine and return the first generated intermediate model as the fastest intent recognition model. As shown in fig. 10, the service may determine the first intermediate model generated first as the fastest intent recognition model. In other embodiments, the server may determine and return the model generated using the specified method, identifying the model for the fastest intent. In this way, the skill developer can acquire the intention recognition model most quickly to know the function, performance and the like of the intention recognition model.

S108, the server determines the optimal intention recognition model.

Generally, after the server finishes learning according to the positive data and the negative data corresponding to the first skill and the second base model, the finally obtained intention identification model is the best intention identification model. This is because, during the learning process of the server, the selection of the model and the adjustment of the parameters are performed according to the plurality of intermediate models generated successively, so that the accuracy of the finally obtained intention recognition model is the highest. For example, a network search (grid search) method, a Bayesian optimization (Bayesian optimization) method, or the like may be used for parameter tuning.

It should be noted that parameter tuning in the prior art is based on training data corresponding to the first skill input by the first skill developer, so that the accuracy of the finally obtained intention recognition model also reflects the accuracy of prediction performed by using data in the first skill.

However, in the embodiment of the application, the server uses data outside the first skill (for example, automatically generated negative data of the first skill, and uses data of a second skill similar to the first skill when the basic model is expanded) in addition to the data inside the first skill (for example, positive data of the first skill input by the developer of the first skill) when training the intention recognition model.

Thus, as shown in FIG. 10, the server needs to consider data within the first skill (e.g., positive data for the first skill) and data outside the first skill (e.g., negative data for the first skill automatically generated by the service) in selecting the best intent recognition model. The accuracy that the server needs to consider is two-fold. One is the accuracy of the prediction using data within the first skill, i.e., the accuracy of the data of the first skill. When the prediction is performed using data within the first skill, the prediction is correct for a positive class (i.e., it can be identified as an intent to invoke the first skill). The second is the accuracy of the prediction by using the data outside the first skill, namely the accuracy of the data outside the first skill. When the prediction is performed using data outside the first skill, the prediction is correct for negative class (i.e., it is possible to recognize the intent that the first skill cannot be invoked).

In some embodiments, the rate of false recalls for the out of first skill data may also be used to reflect the accuracy of predictions using the out of first skill data. The lower the false recall rate of the first out-of-skill data, the higher the accuracy of the predictions using the first out-of-skill data. It can be seen that the accuracy of the first skill data is inversely proportional to the accuracy of the first out-of-skill data and directly proportional to the false recall rate of the first out-of-skill data. That is, the accuracy of the first skill data is high, and the false recall rate with the first skill is also high. Then, the finally trained intention recognition model cannot be optimized in both aspects, and therefore, the embodiment of the present application provides a method for evaluating the optimal intention recognition model.

The server may set a confidence level for the first skill, which may be set by a user (e.g., a developer of the first skill). The confidence parameter may reflect the relative requirements of the user on the two indicators of the accuracy of the data in the first skill and the false recall rate of the data out of the first skill. For example: the greater the confidence, the lower the false recall rate that indicates that the user wishes the data outside the first skill, i.e., the higher the accuracy of the prediction of the data outside the first skill. Thus, the best intent recognition model can be evaluated using the following equation 1, as follows:

score ═ accuacyIn (1-C) + accuacyOut C (formula 1)

Wherein score is the score of the best evaluation intention recognition model, and the higher the score is, the higher the accuracy of the intention recognition model is, and the more the intention recognition model meets the requirements of the user. accuracyIn is the average accuracy of the data in the first skill.

accuracyOut is the average accuracy rate according to data outside the first skill. C is a numerical value of confidence coefficient set by the user, and C is greater than or equal to zero and less than or equal to 1.

The server may employ, for example, a K-fold cross validation method when calculating the average accuracy of the data within the first skill and the average accuracy of the data outside the first skill. The specific method of K-fold cross validation may refer to the prior art, and is not described herein again.

In other embodiments of the present application, in order to avoid the unreasonable setting of confidence level by the user, the accuracy of training the intention recognition model may be greatly affected. For example: when the confidence level set by the user is too high, the accuracy of the data in the skill is ignored. For this purpose, the server may also set a parameter that controls the confidence level set by the user. Then, the following equation 2 can be used to evaluate the best intent recognition model, as follows:

score ═ accuacyIn (1-C × P) + accuacyOut × (formula 2)

Wherein, the meanings of the parameters score, accuracyiin, accuracyOut and C are the same as those in formula 1, and are not described again. P is a parameter set by the server, and is more than or equal to zero and less than or equal to 1. The P value may control the magnitude of the degree of influence of the confidence level set by the user on score. The larger the P value, the greater the effect on score of the confidence that the user is allowed to set. The smaller the P value, the smaller the effect on score of the confidence that the user is allowed to set.

It should be noted that, in the process of executing steps S107 and S108, the server may share some common features of the intermediate model through the sharing layer, so as to reduce the amount of computation and improve the efficiency of model training.

Then, after learning the fastest or the best intention recognition model, the method can be used for predicting a new user utterance input by a user of the electronic equipment, and determining the intention of the user so as to provide corresponding services for the user.

For example: the human-machine dialog system shown in fig. 1 can deploy the fastest or optimal learned intention recognition model on the server 300, and also on the server 200. When deployed at server 300, server 300 may store intent recognition models corresponding to different speech skills of a plurality of third-party applications (e.g., third-party application 1 and third-party application 2). When deployed at server 200, server 200 may then store intent recognition models corresponding to the different speech skills of the third-party application.

Another example is: the fastest or best learned intention recognition model can be deployed on the server 200, as shown in the man-machine interaction system of fig. 4, and the server 200 can store intention recognition models corresponding to different voice skills of the third-party application 1.

The application of the intention recognition model can refer to the foregoing description and will not be described in detail.

Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed to a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A method of learning an intent recognition model, the method comprising:

the server receives forward data corresponding to each intention in the first skill input by the skill developer;

the server generates negative data corresponding to each intention in the first skill according to the positive data corresponding to each intention in the first skill;

the server acquires training data corresponding to a second skill, wherein the second skill is a skill similar to the first skill, and the number of the second skill is at least one;

the server learns according to training data corresponding to the second skill and a preset first base model to generate at least one second base model;

and the server learns according to the positive data corresponding to each intention in the first skill, the negative data corresponding to each intention in the first skill and the second base model to generate an intention identification model.
The learning method of the intent recognition model according to claim 1, wherein the server generating negative data corresponding to each intent in the first skill from the positive data corresponding to each intent in the first skill comprises:

the server respectively extracts keywords corresponding to each intention in the first skill aiming at each intention in the first skill, wherein the keywords are key features influencing the weight of the first base model;

combining keywords corresponding to different intentions in the first skill, or combining keywords corresponding to different intentions with words related to the non-first skill, and determining the combined words as negative data corresponding to the different intentions.
The learning method of the intention recognition model according to claim 1 or 2, wherein the server generating negative data corresponding to each intention in the first skill from the positive data corresponding to each intention in the first skill further comprises:

the server determines a first set from a full set of training data stored on the server according to the classification of the first skill, wherein the training data in the first set comprises negative data corresponding to each intention in the first skill; wherein the training data comprises positive data and negative data;

sampling a preset amount of training data from the first set;

and determining negative data corresponding to each intention in the first skill from the training data obtained by sampling by adopting a manual labeling and/or clustering algorithm.
The learning method of the intention recognition model according to any one of claims 1 to 3, wherein the server acquiring data corresponding to the second skill comprises:

the server determining the second skill according to the classification of the first skill and/or the forward data of the first skill;

and acquiring training data corresponding to the second skill through a sharing layer.
The learning method of the intention recognition model according to any one of claims 1 to 4, wherein the server learns the training data corresponding to the second skill and a preset first base model, and generating the second base model comprises:

and the server learns by adopting a multi-task learning method according to the training data corresponding to the second skill and a preset first base model to generate a second base model.
The method for learning intent recognition model according to any one of claims 1-5, wherein the server learns from the positive data corresponding to each of the first skills and the negative data corresponding to each of the first skills and the second base model, and the generating the intent recognition model comprises:

and the server learns by adopting an ensemble learning method according to the positive data corresponding to each intention in the first skill, the negative data corresponding to each intention in the first skill and the second base model to generate an intention identification model.
The learning method of an intention recognition model according to claim 6, wherein the server generating the intention recognition model specifically includes:

the server generates a plurality of intermediate models in the process of generating the intention recognition model;

the server determines a first generated intermediate model as a fastest intention recognition model; and/or after the server selects the models and adjusts the parameters of the plurality of intermediate models, determining the optimal intention recognition model.
The learning method of the intention recognition model according to claim 7, wherein the server determining the best intention recognition model comprises:

the server calculates a first accuracy rate of each intermediate model according to forward data corresponding to each intention in the first skill; the server calculates a second accuracy rate of each intermediate model according to negative data corresponding to each intention in the first skill;

and the server determines the optimal intention recognition model after performing model selection and parameter adjustment on the plurality of intermediate models according to the first accuracy, the second accuracy and the weight input by the skill developer.
A server, comprising: a processor, a memory, and a communication interface; the memory is configured to store computer program code comprising computer instructions that, when read from the memory by the processor, cause the server to:

receiving forward data corresponding to each intention in the first skill input by the skill developer through the communication interface;

generating negative data corresponding to each intention in the first skill according to the positive data corresponding to each intention in the first skill;

acquiring training data corresponding to a second skill, wherein the second skill is similar to the first skill, and the number of the second skill is at least one;

learning according to training data corresponding to the second skill and a preset first base model to generate at least one second base model;

and learning according to the positive data corresponding to each intention in the first skill, the negative data corresponding to each intention in the first skill and the second basic model to generate an intention identification model.
The server according to claim 9, wherein the server generating negative data corresponding to each intention in the first skill from the positive data corresponding to each intention in the first skill specifically comprises:

the server respectively extracts keywords corresponding to each intention in the first skill aiming at each intention in the first skill, wherein the keywords are key features influencing the weight of the first base model;

combining keywords corresponding to different intentions in the first skill, or combining keywords corresponding to different intentions with words related to the non-first skill, and determining the combined words as negative data corresponding to the different intentions.
The server according to claim 9 or 10, wherein the server generating negative data corresponding to each intention in the first skill from the positive data corresponding to each intention in the first skill further comprises:

the server determines a first set from a full set of training data stored on the server according to the classification of the first skill, wherein the training data in the first set comprises negative data corresponding to each intention in the first skill; wherein the training data comprises positive data and negative data;

sampling a preset amount of training data from the first set;

and determining negative data corresponding to each intention in the first skill from the training data obtained by sampling by adopting a manual labeling and/or clustering algorithm.
The server according to any one of claims 9-11, wherein the server obtaining data corresponding to a second skill comprises:

the server determining the second skill according to the classification of the first skill and/or the forward data of the first skill;

and acquiring training data corresponding to the second skill through a sharing layer.
The server according to any one of claims 9 to 12, wherein the server learns according to training data corresponding to the second skill and a preset first base model, and generating the second base model comprises:

and the server learns by adopting a multi-task learning method according to the training data corresponding to the second skill and a preset first base model to generate a second base model.
The server according to any one of claims 9-13, wherein the server learns from the positive data corresponding to each of the intents in the first skill and the negative data corresponding to each of the intents in the first skill, and the second base model, and wherein generating the intent recognition model comprises:

and the server learns by adopting an ensemble learning method according to the positive data corresponding to each intention in the first skill, the negative data corresponding to each intention in the first skill and the second base model to generate an intention identification model.
The server of claim 14, wherein the server generating the intent recognition model further comprises:

the server generates a plurality of intermediate models in the process of generating the intention recognition model;

the server determines a first generated intermediate model as a fastest intention recognition model; and/or after the server selects the models and adjusts the parameters of the plurality of intermediate models, determining the optimal intention recognition model.
The server of claim 15, wherein the server determines the best intent recognition model comprises:

the server calculates a first accuracy rate of each intermediate model according to forward data corresponding to each intention in the first skill; the server calculates a second accuracy rate of each intermediate model according to negative data corresponding to each intention in the first skill;

and the server determines the optimal intention recognition model after performing model selection and parameter adjustment on the plurality of intermediate models according to the first accuracy, the second accuracy and the weight input by the skill developer.